dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Can we create a grammar induction algorithm based upon expected Pegged parsing failures? #320

Closed enjoysmath closed 1 year ago

enjoysmath commented 1 year ago

I don't know much about grammar induction, though I do know that it would be very useful. For example we could take a properly typeset math document like a Lang's Algebra written in the driest English-math language, and automatically create a grammar for it. Reason: writing a 5000 line grammar would be error-prone and take years of spare time.

I was thinking, you start out with a grammar that looks for "." delimited sentences. Maybe some number regex is present, etc. Just the basic commonly used things that you don't want to overburden the induction algorithm with.

Then upon a failure node in the pegged ParseTree, you simply "run the algorithm again, recursively" on the appropriate texts. Of course you'd use rdmd to recompile the grammar, and hopefully that isn't terribly slow at run-time. Any further ideas of how this algorith might work?

For example if A -> B / C / D is an Or-based rule, and the parser runs into a failure at A, then you "induct a another subgrammar with base variable E", then you fixup the prior grammar to be A -> B / C / D / E.

I know this is out-of-scope for pegged, but it would also be nice if it just worked this way out-of-the-box. Again, you provide a simple grammar of stuff you know you'll have in the final grammar, then run the induction algorithm on a large body of text.

Also, since each construct in the grammar will need a D-structure/class backend to "do things with", the inductor should also generate these classes in an organized way. And since the total grammar would be 1-5 thousand lines, you'd break it up into a chunk for each class, and use asModule appropriately.

What are your thoughts on this?

enjoysmath commented 1 year ago

If running rdmd adds a huge overhead, then you try to "find many failures in one pass" and modify the grammar at many places before each compilation.

veelo commented 1 year ago

Sounds like you want a system that learns a language just by reading a lot of it. I have nothing to say on that subject.