dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
533 stars 66 forks source link

is it okay for the D grammar to compile in 10 minutes and use 45 gigs of ram? #294

Open exoticus opened 4 years ago

exoticus commented 4 years ago

Compiling the dgrammer.d takes close to 10 minutes and apparently leaks memory like crazy, with dmd taking 45 gigs of ram

I am compiling on osx and here're the stats

Ram: image

Pegged Version:

0.4.4

TIme:

Executed in  636.23 secs   fish           external 
   usr time  599.07 secs   87.00 micros  599.07 secs 
   sys time   21.22 secs  735.00 micros   21.21 secs 

dmd/dub Version:

DMD64 D Compiler v2.092.1: DUB version 1.21.0
veelo commented 4 years ago

What version of Pegged are you using? If it is a beta, can you compare against v0.4.4 please?

exoticus commented 4 years ago

@veelo i'm on 0.4.4 already

For completeness sake, i tried 0.4.5-beta.2 and ended up with the same behavior, note i am not doing anything fancy, just compiling dgrammer.d

veelo commented 4 years ago

note i am not doing anything fancy, just compiling dgrammer.d

This one? https://github.com/PhilippeSigaud/Pegged/blob/master/examples/dgrammar/src/pegged/examples/dgrammar.d

I appended

mixin(grammar(Dgrammar));

unittest
{
    import std.stdio;
    writeln(D(`enum E = "Hello";`));
}

and ran dub test, and it took indeed ca. 5 minutes and 25GB. If your example is longer, I can imagine resource usage will be more like what you are seeing.

But this generates the parser at compile-time, which is probably the reason why this takes so long. With bigger grammars it is better to use asModule, which I will try next.

veelo commented 4 years ago

OK, asModule (after the model of the extended_pascal example) does not help with the time consumption. But it allows to take an extra -lowmem compiler option to take effect, which keeps memory usage around 1GB, at the cost of an increased compilation time of 20 minutes! I don't know if that's worth it.

I don't really know what we should do with this example. As it is now, the example is incomplete and merely defines a string enum, and there is not much point to test this in CI. I have no idea why this example is so demanding, extended_pascal is just marginally smaller, and finishes building in 4 seconds.

FYI regarding your observation "leaks memory like crazy": this is considered a feature of the compiler, which refrains from freeing memory by default. The -lowmem option changes that.

exoticus commented 4 years ago

@veelo

extended_pascal is just marginally smaller, and finishes building in 4 seconds.

oh, i thought that like dgrammar.d any large grammar would be like that, tbh it's unusable, any idea where to even start investigating why this happens with this particular grammar, if that's the case?

this is considered a feature of the compiler, which refrains from freeing memory by default. The -lowmem option changes that.

My bad, dmd taking all that time made me think that there's something somehow stuck and using memory, want to note too that the resulting executable is 20Mb, isn't that bizarre?

veelo commented 4 years ago

Yes, something is going on here that would be interesting to investigate. However, I myself don't consider it worth my time. I think this example grew out of a curiosity of Philippe to see how far the capabilities of Pegged stretch. It was never complete, is now out of date, and also has some extra experiments (such as macros). It is uncertain to me whether the complete D grammar can in fact be accurately defined as a PEG. And if possible, what point is there now that dmd can be used as a library? In my eyes resources are better spent on improving things for grammars that are in actual use, if necessary.

Pegged is very heavy on templates, and it is well known that templates can result in considerable bloat in the executable. There are individuals that think they can develop a better template engine in the compiler (Stefan Koch) but as usual the problem is time (funding).

Pegged itself also bears signs of unfinished experiments and unneeded complexity, possibly things can be improved a bit by a cleanup.

exoticus commented 4 years ago

@veelo really appreciate the input, when i asked that something might be wrong i meant a problem that prevailed in the dgrammer that will probably reveal itself again once another grammar gets big enough, or maybe hits the same combination that triggered this behavior, but understand that if it's an issue with this particular grammar alone as a result of experimentation it shouldn't be a priority. I am just new to pegged, fetched out that grammar as it's the biggest one so matching my planned grammar in terms of size, yet i was surprised of the time/memory, thought this is how it would get with any other big grammar

veelo commented 4 years ago

Your concerns are legitimate, and there is no guarantee that you won't run into problems. As usual with volunteer-run Open Source projects, you may have to put in an unexpected amount of extra work, but at least you have the possibility to do that. I ran into a fundamental problem myself (left-recursion) and had to do what felt like moving a mountain, but succeeded at last. Now left-recursion is not a problem anymore. I ended up looking after Pegged as a maintainer, but it is just one link in a chain of tools that I use and hence does not receive my full attention.

Out of curiosity, is your planned grammar that of an existing language, or will you be developing it as you go? PEGs are not a good match for every language, due to its use of prioritised choice. Pegged does support a longest match alternator to combat that limitation, but that breaks the linear time promise of a pure PEG.

exoticus commented 4 years ago

@veelo totally understand the oss and that no maintainer is obligated to do nothing if he/she doesn't feel like doing it, i was just sharing a concern nothing more, and if i get deep enough into pegged i'll be happy to contribute back, about the language, It's a C-like language that i am developing, the reason i picked PEGs is that coding recursive decent parsers gets really tedious after a while, and you end up with so much moving parts that even small changes become a hassle, so i started looking for a better, context free grammars in general didn't fit because more often than not they have to deal with ambiguity in weird ways, but PEGs on the other hand are recursive decent naturally. One more thing that made me consider it and feel safe that i am not making a stupid choice is that Python is moving to PEG for it's parser/lexer/ast combo and the author made a pretty nice argument for that decision. So yeah will see how it goes

veelo commented 4 years ago

Interesting reference, nice to see they tackled the left-recursion challenge as well. Since you are developing a language as a PEG, I bet you'll have a much better experience than trying to convert an existing language into a PEG. Welcome on board! (The D language specification is reverse engineered from a manually written lexer+parser, and converting that into a PEG is a different game.)

munael commented 10 months ago

Why is PEG unsuitable for the D grammar?

It is uncertain to me whether the complete D grammar can in fact be accurately defined as a PEG.

I'm new to parser generators and combinators, and to PEGs too. My understanding is that the theory behind it models Recursive Descent parsers. Can you share your thoughts from that the time on where in the D grammar a PEG would fail? I've also been looking at the DMD implementation of the lexer/parser but it's quite big.

Thank you :)

cc @veelo

veelo commented 10 months ago

Hi @munael.

I think delimited strings is one aspect of the D grammar that is impossible to capture in a typical PEG parser generator:

string hello = q"HELLO
This is a HELLO delimited string.
Double quotes " may be unbalanced!
HELLO";

This is because the delimiter is unknown until parsing is halfway through the expression to be parsed.

However, Pegged is not typical, and I recently found out that you can use a user defined parser for these kinds of language constructs: #342. This is cool!

So nowadays I would nuance myself and say that it may be possible to create a Pegged generated parser for the entire D language, but I can't be certain until somebody does it. Another aspect is mixins, which may be tricky.

However, Pegged is currently not implemented optimally regarding speed, and trumping DMD in terms of speed is a tall order. So if anybody would be interested in working on this, the motivation would be something else than speed. And then there will always be the issue of keeping it up to date.

munael commented 10 months ago

@veelo

That's a good example 👀 Is it the case that the PEG formalism supports only stateless parsers?

Semantic actions suggest no, but semantic actions act after rules and can't create their own captures as far as I understand it. The implementation may support it, but it doesn't look like a proper PEG.

I would've expected delimited strings to be handled at the lexer level, to be honest. A custom parser could act as one, but what would the pre-parsed tokens in a delimited string look like?

veelo commented 10 months ago

That's a good example 👀 Is it the case that the PEG formalism supports only stateless parsers?

Yes. https://scg.unibe.ch/assets/download/lectures/cc/CC-09-PEGs.pdf

Semantic actions suggest no, but semantic actions act after rules and can't create their own captures as far as I understand it.

Correct.

I would've expected delimited strings to be handled at the lexer level, to be honest. A custom parser could act as one, but what would the pre-parsed tokens in a delimited string look like?

A PEG parser is a lexer and parser in one, so there is no lexer level.