Closed kanielc closed 10 years ago
I've made a small modification as I merged so the check uses the pre identified structures list, see 5d57477db32f2671b1197de861c60479934cfcda.
The performance benefit is likely negligible because of cache coherency as we discussed before, but if we already have that information readily available there's no need to scan over the whole stream in search for them.
I imagine there may be significant performance boosts for some of the deeper darker files within Boost where the source files are several megabytes in size and contain hundreds of thousands of tokens.
Yeah, I realized it made sense when I noticed the versions taking the structure, but was too tired to keep going ;)
That's fair enough it was a simple change of the outer loop given how you used the nice and clean std::find functions.
Just timed the linting Boost and got an average of 32.190
seconds using the structures lists vs. 35.147
without so there is a couple of seconds squeezed out of the swap.
I noticed as the linter was going two outputs to stderr, I was wondering if you could confirm that you get them too. I'm going to have a little hunt through boost now to see if it's obvious why the tokenizer fails for these two files.
Exception thrown during checks on .\boost\spirit\home\classic\utility\rule_parser.hpp.
Misplaced backslash in .\boost\spirit\home\classic\utility\rule_parser.hpp:422
Exception thrown during checks on .\boost\type_erasure\tuple.hpp.
Invalid character: ` in .\boost\type_erasure\tuple.hpp:571
Yeah, I get those exceptions too.
The error in rule_parser appears to be because of horrendous abuse of preprocessor directives... I can't imagine anyone else in the right minds would be trying to lint this style of code.
Again the error in tuple.hpp seems to be related to preprocessor macros. This time where a macro parameter is a piece of syntax, a back-tick.
Perhaps these single types of single character token errors could be prevented if we introduced a new token type TK_UNEXPECTED
which is placed into the stream, then allowing parsing to continue. We could still output to stderr, but as a heads up rather than a full blown exception.
That could work, it may cause us to miss a few cases of sequences matching some valid pattern (if the UNEXPECTED was ignored), but seeing that we wouldn't have caught those anyway, it's a fair trade.
Looking at the tokenizer I just realized in the case of a \
for a preprocessor directive it requires the backslash is immediately followed by a newline character. On line 422 of rule parser there are 3 space characters after the line which causes the error.
This just leaves the other error in tuple which seems to be an issue with arbitrary syntax being handled with macros. I'm not sure there is a lot that can be done with that, sadly.
It appears the back-tick character isn't actually a valid char to have within the C++ standard. So this is actually an intentional error caused by the follow tokenizer case.
case '`':
FBEXCEPTION("Invalid character: " + string(1, c) + " in " + string(file + ":" + to_string(line)));
break;
I have a feeling we can deal with this more elegantly. Perhaps with we pass the errorFile
object to the tokenizer and give it the ability to add an error to the output.
It's a bit strange that the code compiles but is in gross violation of the standard. I agree though, it's better to note it in the errors that to just do a vague and file-fatal exception.
I've made the changes to the tokenizer, boost now lints entirely and we issue an Error in the output for the invalid character.
Directly navigated the tokens (didn't use the structures of classes/structs/unions, as I saw that after).
Seemed simple enough navigating directly I guess. Also updated the test file (now that we've sorted it.