Open jgm opened 7 years ago
If you decide to rewrite them by hand, perhaps my code could be useful:
https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/Scanner.cs https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/ScannerHtmlTag.cs
I did rewrite them by hand, it wasn't the hardest part when porting the parser, as for the maintenance, I haven't had any problems with the few changes that were required so far.
Would flex
provide a better trade-off between convenience and amount of generated source code? (IIRC flex
uses tables to represent DFAs, not thousands of if () goto
lines.)
"flex - The Fast Lexical Analyser (GitHub)"
But generally hand-coded scanners are IMO preferrable, for size and configurability reasons (ie, such scanning is easier to influence by command-line options to the cmark
executable etc.). But this can still be done when the syntax is stable and "frozen".
I say go for it. Doesn't need to be all or nothing. Could just do a few functions at a time. May be some possible speed improvements as well.
What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.
+++ Mathieu Duponchelle [Dec 05 16 04:41 ]:
What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.
It's not a big problem. But it's a very big source file. Some people object to that, it seems (see the md4c announcement).
(There's also the problem that it's a generated file in version control, but that hasn't been a big problem so far.)
@jgm, unless there's an actual technical reason to do so, I really don't see why we should do that to be honest. If it took multiple gigs of memory to compile this file for example, that could be a compelling reason, but I don't see the point in "compactness" for the sake of it, I'd much rather have easy-to-read sources to be honest.
While I find the ~30000 lines (~400 KB) sized scanners.c
a bit hefty, the tools do handle it easily. I once looked into a linker map file and found that the scanners.c
contributed around 50 KB to the executable size, IIRC.
Re-generating this file (and comitting it) has never bothered me.
So all in all, I don't loose sleep over it, at least not for the time being ;-)
There's also issue #121 but this is caused by a GCC bug. The biggest problem I see is that re2c generates needlessly repetitive code for quantifiers like {1,31}
which is the main cause of the large object code size.
I regenerated scanners.c using re2c 0.16, which includes some dfa minimization. That cut off 65K+ from the generated source (though it's still pretty big).
Currently we generate a number of scanners from regexes using re2c.
This has two advantages:
Disadvantage: Either we require re2c as a build dependency, or we have to ship a gigantic scanners.c file.
Should we simply hand-write the scanners and dispense with re2c and scanners.re?