boostorg / wave

Boost.org wave module
http://boost.org/libs/wave
21 stars 49 forks source link

Update marker before and after fill() operation in cpplexer #203

Closed chisolm closed 5 months ago

chisolm commented 5 months ago

I think this change is in line with how s->ptr and marker are expected to interact.

This is in reference to: https://github.com/boostorg/wave/issues/202

jefftrull commented 5 months ago

We'll need to take a careful look at this change, as YYFILL plays a central role in the lexer. I'd be surprised if it's this wrong, given our extensive test suite, but of course it's possible...

jefftrull commented 5 months ago

In particular, the re2c manual calls YYMARKER A pointer-like l-value (usually a pointer of type YYCTYPE*) that stores the position of the latest matched rule. It is used to restore the YYCURSOR position if the longer match fails and the lexer needs to rollback. By contrast YYFILL is used to provide additional input when the current buffer has been emptied. At the time YYFILL is run, the input it is going to return has not been seen by the lexer, and thus IMO it would be inappropriate to change the position of the latest matched rule as a side effect.

I think there may be a more likely explanation: some kind of bug in the fill function

chisolm commented 5 months ago

I would not be surprised if there is a better solution or that this solution doesn't hold for some use cases.

I think I can show that YYMARKER is at least part of the problem and/or solution.

The state machine for picking out the quoted strings is the code I've been working through, so I think I can explain it there.

In that state machine, we go to yy14 on the opening quote: https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.inc#L136

yy14: yyaccept = 0; yych = *(YYMARKER = ++YYCURSOR); if (yych <= '\n') { if (yych == '\t') goto yy99; } else { if (yych <= '\f') goto yy99; if (yych >= ' ') goto yy99; }

YYMARKER is set to the next character in the quote, providing the rewind capability, that I think you're mentioning.

The in yy99 and yy98 cyle, which we go to next: https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.inc#L775C1-L787C1

We continue consuming characters in the input until YYLIMIT == YYCURSOR and then we need to YYFILL() to increase the buffer size. Then in the fill, we update the buffers:

https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L39

to:

https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L197

at this point in the fill we are copying the bit of old data left in the old buffer via the memove and the for loop after this chunk fills the buffer up with up to BOOST_WAVE_BSIZE amount of data.

Now we free the old buffer where YYMARKER is still pointing: https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L217

The returning back to yy98 we eventually hit a character (newline in my case) that is an error and YYCURSOR is updated from YYMARKER which points into free'd memory.

With that I'd at least say YYMARKER is not being handled correctly.

There is also a secondary problem in: https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L182C1-L193C10 that updating s->ptr from YYMARKER on the way into fill() will fix as well.

So, I'm basing my concept of how YYMARKER and s->pointer are suppossed to interact based on: https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L182C1-L193C10 and https://github.com/boostorg/wave/blob/543a0e9e1d0e6190a571c940945a68fbded5da56/include/boost/wave/cpplexer/re2clex/cpp_re.hpp#L375

Seems like s->ptr and marker are intended to track each other.

BTW I'm trying to get my dev environment setup for boost and having some trouble, still working on helping with the test case reproduction.

jefftrull commented 5 months ago

In case you find the re2c input more readable - as I do - the command line we use is as follows:

re2c -b -o cpp_re.inc cpp.re

You want the 0.13.5 version, which appears to now be unavailable on their website :( but 0.13.6 is there and will probably produce similar results

chisolm commented 5 months ago

I made a new branch on my fork: https://github.com/chisolm/wave/tree/YYMARKER_reproduction

I was unable to reproduce the same behavior. In fact it is seemingly exiting with no error code but it is truncating output. I did this testing on linux.

I was unable to get a good break point under gdb, it did not seem to be indexing the .hpp files under re2c as source files. So was was reduced to debugging by printf.

The test file I committed was basically the same, some of the x characters changed to something else to make it easier to see the diffs. There are two calls to fill, one for the initial buffer and the second is the one of interest and it shows marker with the old pointer.

I put in a few prints to show marker is refering to the free'd page. I think this behavior at least should replicate for you fairly easily.

(base) cchisolm@ubuntu:~/boostdev/boost/libs/wave$ ../../bin.v2/libs/wave/tool/build/gcc-7/debug/threading-multi/wavetool-on/wave test/testwave/testfiles/t_9_029.cpp -E count 0

Old block pointers bot 0 0 0 0 0 Old block adjusted pointers bot 0 0 0 0 0 Freeing old block 0 0x30000 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 New block adjusted pointers bot 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 marker points to: 0 count 196570 "1234567891123456789212345678931234567 1234567891123456789212345678931234567 234567891123456789212345678931234567 34567891123456789212345678931234567 4567891123456789212345678931234567 567891123456789212345678931234567 67891123456789212345678931234567 7891123456789212345678931234567 891123456789212345678931234567 91123456789212345678931234567 1123456789212345678931234567 123456789212345678931234567 23456789212345678931234567 3456789212345678931234567 456789212345678931234567 56789212345678931234567 6789212345678931234567 789212345678931234567 89212345678931234567 9212345678931234567 212345678931234567 12345678931234567 2345678931234567 345678931234567 45678931234567 5678931234567 678931234567 78931234567 8931234567 931234567 31234567 1234567 234567 34567 4567 567 67 7 Old block pointers bot 0x7fd30797f010 0x7fd3079aefea 0x7fd3079aefea 0x7fd3079aefea 0x7fd3079af010 Old block adjusted pointers bot 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 0x7fd30797f010 0x7fd30797f036 Freeing old block 0x7fd30797f010 0x7fd3079af010 0x7fd307833010 0x7fd307833010 0x7fd307833036 New block adjusted pointers bot 0x7fd307833010 0x7fd307833010 0x7fd307833010 0x7fd307833010 0x7fd307833036 marker points to: 0x7fd3079aefeb /home/cchisolm/boostdev/boost/libs/wave/test/testwave/testfiles/t_9_029.cpp:5715:42: warning: last line of file ends without a newline (base) cchisolm@ubuntu:~/boostdev/boost/libs/wave$ tail test/testwave/testfiles/t_9_029.cpp xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx l1 l2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"1234567891123456789212345678931234567894 " xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx do we see this data after the fill? (base) cchisolm@ubuntu:~/boostdev/boost/libs/wave$ tail test/testwave/testfiles/t_9_029.i xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx l1 l2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx(base) cchisolm@ubuntu:~/boostdev/boost/libs/wave$

hkaiser commented 5 months ago

I think this should be much easier to debug when setting BOOST_WAVE_BSIZE to something small.

jefftrull commented 5 months ago

It's funny you should mention that :) I think I've found at least two bugs since yesterday that way. One of them does indeed point to an issue with updating marker (and I think, also limit) when a fill occurs. @chisolm I am coming around to your point of view!

jefftrull commented 5 months ago

@hkaiser I created a test case for this issue that seems to be independent of BOOST_WAVE_BSIZE (though it does require knowledge of its value, for which purpose I removed a #undef). If @chisolm accepts my PRs it will become part of this.

hkaiser commented 5 months ago

@hkaiser I created a test case for this issue that seems to be independent of BOOST_WAVE_BSIZE (though it does require knowledge of its value, for which purpose I removed a #undef). If @chisolm accepts my PRs it will become part of this.

Your fix (https://github.com/jefftrull/wave/tree/bugfix/marker-testcase) looks good to me. Thanks for your work on this!

chisolm commented 5 months ago

@jefftrull Your fix looks good, I really like the logic, it's a clean and simple test.

I'm getting a compiler error on my linux setup when I try to compile and test your test:

test/build/../testwave/fill_boundary.cpp:38:71: error: use of deleted function ‘boost::wave::context<gnu_cxx::__normal_iterator<char*, std::cxx11::basic_string >, boost::wave::cpplexer::lex_iterator<boost::wave::cpplexer::lex_token<> >, boost::wave::iteration_context_policies::load_file_to_string, boost::wave::context_policies::default_preprocessing_hooks>::context(boost::wave::context<gnu_cxx::__normal_iterator<char*, std::cxx11::basic_string >, boost::wave::cpplexer::lex_iterator<boost::wave::cpplexer::lex_token<> >, boost::wave::iteration_context_policies::load_file_to_string, boost::wave::context_policies::default_preprocessing_hooks>&&)’ auto ctx = ctx_t( inp_txt.begin(), inp_txt.end(), "longfile.cpp" );

Plus I get a warning on trigraphs and multiple for a redef of BOOST_FUNCTION_NUM_ARGS.

The most likely culprit is that my build setup is not correct. If it works on yours and will be tested by normal regression tests, I'd be inclined to just accept it. You're the experts here, I'll take your lead on this.

chisolm commented 5 months ago

@jefftrull Ah, I see:

auto ctx = ctx_t( inp_txt.begin(), inp_txt.end(), "longfile.cpp" ); to ctx_t ctx( inp_txt.begin(), inp_txt.end(), "longfile.cpp" );

resolves the issue. Although I don't understand how that can be setup dependent.

jefftrull commented 5 months ago

That's odd, it's treating it as though we were creating a context object and then moving it into place, as opposed to initializing it. What's your compiler version? I am on linux also (I'm 11.4). CI should tell us if there is an issue.

I don't know what the BOOST_FUNCTION_NUM_ARGS issue is but we should certainly avoid it. I wonder if there is a missing include guard somewhere.

chisolm commented 5 months ago

I'm on a ubuntu 18.04 and a gcc 5.4. That does probably explain why it doesn't find the optimization of not moving the initialized object. It makes sense. I had forgot that on the older ubuntu the default gcc packages were really old.

Is there a lower bound on gcc support in boost?

I tested the test with the modified initialization, and it does get the lexing exception. It

jefftrull commented 5 months ago

Well, the CI fails with that issue, on all versions :) I can't explain why it works on my system! Would you mind applying the fix you described?

chisolm commented 5 months ago

Sure, not a problem.

jefftrull commented 5 months ago

@chisolm it looks like there were two separate changes with C++17. One of them made the auto X = Y(...) thing work correctly. The other causes trigraphs inside string literals to not be translated, by default. So for my test to work in all versions of the Standard I have to insert the trigraph one character at a time in the code! See my latest PR.

chisolm commented 5 months ago

I accepted it, you're much more likely than me to know what changes there are in the language spec.

Is the order of approval/acceptance of changes a little off in this setup? Is this just because I initiated a PR off of a fork? Let me know if you'd like to switch this around to a PR you own.

It seems that at least a couple builds failed because of: [errno 2] failed to scan file '': No such file or directory

It doesn't seem to be part of the compile of this file nor the test of this file. On my local build it seems like it occurs before both build and test stages. I tried the following flags to locate it a bit better: ../../b2 --v2 -d2 --debug-configuration variant=debug tool/build test

However I'm not narrowing it down.

chisolm commented 5 months ago

Looks like we just had a successful run. I checked the logs on one run the "failed to scan file" message was just a wild goose chase, I had thought it was a second problem.

jefftrull commented 5 months ago

Yes, agreed, I've seen it before but it doesn't seem to affect the result.

Re: this process, I wanted to be sure to give you credit by having the PR reflect your contribution. Not only did we find the issue you identified, we found two others at the same time. It's remarkable that no one noticed before... This should help a lot with stability for power users with large input files.

chisolm commented 5 months ago

Thanks I appreciate it, I don't get to contribute to open source very often. I got my 2 lines of changes in! :-)