Consider writing scanners by hand

commonmark / cmark

CommonMark parsing and rendering library and program in C

Other

1.64k stars 546 forks source link

Consider writing scanners by hand #167

Open jgm opened 7 years ago

jgm commented 7 years ago

Currently we generate a number of scanners from regexes using re2c.

This has two advantages:

It is easy to see at a glance what the scanners do, and easy to modify them.
The scanners generated are DFAs and should perform well. Since re2c is well tested, this also lowers testing burden.

Disadvantage: Either we require re2c as a build dependency, or we have to ship a gigantic scanners.c file.

Should we simply hand-write the scanners and dispense with re2c and scanners.re?

Knagis commented 7 years ago

If you decide to rewrite them by hand, perhaps my code could be useful:

https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/Scanner.cs https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/ScannerHtmlTag.cs

I did rewrite them by hand, it wasn't the hardest part when porting the parser, as for the maintenance, I haven't had any problems with the few changes that were required so far.

tin-pot commented 7 years ago

Would flex provide a better trade-off between convenience and amount of generated source code? (IIRC flex uses tables to represent DFAs, not thousands of if () goto lines.)

"flex - The Fast Lexical Analyser (GitHub)"

But generally hand-coded scanners are IMO preferrable, for size and configurability reasons (ie, such scanning is easier to influence by command-line options to the cmark executable etc.). But this can still be done when the syntax is stable and "frozen".

kainjow commented 7 years ago

I say go for it. Doesn't need to be all or nothing. Could just do a few functions at a time. May be some possible speed improvements as well.

MathieuDuponchelle commented 7 years ago

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

jgm commented 7 years ago

+++ Mathieu Duponchelle [Dec 05 16 04:41 ]:

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

It's not a big problem. But it's a very big source file. Some people object to that, it seems (see the md4c announcement).

(There's also the problem that it's a generated file in version control, but that hasn't been a big problem so far.)

MathieuDuponchelle commented 7 years ago

@jgm, unless there's an actual technical reason to do so, I really don't see why we should do that to be honest. If it took multiple gigs of memory to compile this file for example, that could be a compelling reason, but I don't see the point in "compactness" for the sake of it, I'd much rather have easy-to-read sources to be honest.

tin-pot commented 7 years ago

While I find the ~30000 lines (~400 KB) sized scanners.c a bit hefty, the tools do handle it easily. I once looked into a linker map file and found that the scanners.c contributed around 50 KB to the executable size, IIRC.

Re-generating this file (and comitting it) has never bothered me.

So all in all, I don't loose sleep over it, at least not for the time being ;-)

nwellnhof commented 7 years ago

There's also issue #121 but this is caused by a GCC bug. The biggest problem I see is that re2c generates needlessly repetitive code for quantifiers like {1,31} which is the main cause of the large object code size.

jgm commented 7 years ago

I regenerated scanners.c using re2c 0.16, which includes some dfa minimization. That cut off 65K+ from the generated source (though it's still pretty big).