commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.64k stars 546 forks source link

Consider writing scanners by hand #167

Open jgm opened 7 years ago

jgm commented 7 years ago

Currently we generate a number of scanners from regexes using re2c.

This has two advantages:

  1. It is easy to see at a glance what the scanners do, and easy to modify them.
  2. The scanners generated are DFAs and should perform well. Since re2c is well tested, this also lowers testing burden.

Disadvantage: Either we require re2c as a build dependency, or we have to ship a gigantic scanners.c file.

Should we simply hand-write the scanners and dispense with re2c and scanners.re?

Knagis commented 7 years ago

If you decide to rewrite them by hand, perhaps my code could be useful:

https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/Scanner.cs https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/ScannerHtmlTag.cs

I did rewrite them by hand, it wasn't the hardest part when porting the parser, as for the maintenance, I haven't had any problems with the few changes that were required so far.

tin-pot commented 7 years ago

Would flex provide a better trade-off between convenience and amount of generated source code? (IIRC flex uses tables to represent DFAs, not thousands of if () goto lines.)

"flex - The Fast Lexical Analyser (GitHub)"

But generally hand-coded scanners are IMO preferrable, for size and configurability reasons (ie, such scanning is easier to influence by command-line options to the cmark executable etc.). But this can still be done when the syntax is stable and "frozen".

kainjow commented 7 years ago

I say go for it. Doesn't need to be all or nothing. Could just do a few functions at a time. May be some possible speed improvements as well.

MathieuDuponchelle commented 7 years ago

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

jgm commented 7 years ago

+++ Mathieu Duponchelle [Dec 05 16 04:41 ]:

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

It's not a big problem. But it's a very big source file. Some people object to that, it seems (see the md4c announcement).

(There's also the problem that it's a generated file in version control, but that hasn't been a big problem so far.)

MathieuDuponchelle commented 7 years ago

@jgm, unless there's an actual technical reason to do so, I really don't see why we should do that to be honest. If it took multiple gigs of memory to compile this file for example, that could be a compelling reason, but I don't see the point in "compactness" for the sake of it, I'd much rather have easy-to-read sources to be honest.

tin-pot commented 7 years ago

While I find the ~30000 lines (~400 KB) sized scanners.c a bit hefty, the tools do handle it easily. I once looked into a linker map file and found that the scanners.c contributed around 50 KB to the executable size, IIRC.

Re-generating this file (and comitting it) has never bothered me.

So all in all, I don't loose sleep over it, at least not for the time being ;-)

nwellnhof commented 7 years ago

There's also issue #121 but this is caused by a GCC bug. The biggest problem I see is that re2c generates needlessly repetitive code for quantifiers like {1,31} which is the main cause of the large object code size.

jgm commented 7 years ago

I regenerated scanners.c using re2c 0.16, which includes some dfa minimization. That cut off 65K+ from the generated source (though it's still pretty big).