TheLartians / PEGParser

💡 Build your own programming language! A C++17 PEG parser generator supporting parser combination, memoization, left-recursion and context-dependent grammars.
BSD 3-Clause "New" or "Revised" License
240 stars 21 forks source link

Documentation and a couple questions #62

Open ethindp opened 1 year ago

ethindp commented 1 year ago

So this library looks really cool and I'd love to use it. However, I'm unsure about the syntax (i.e., I know the general PEG syntax, but what extensions, if any, does this library have? How does the syntax it uses differ from normal PEG (which is really just an extension of ABNF/EBNF), etc.). Also, how does it cope with Unicode? A language I'm struggling to write a compiler for requires Unicode for identifiers, so I need some way of handling that without destroying the world in the process. :D

The examples look pretty neat, but they aren't really enough in terms of describing how the parser works or its limitations. For example, how is whitespace handled? Is it an implicit thing? (The examples would have me believe that this is the case, but it's worth asking here.)

TheLartians commented 6 months ago

Hey, thanks for raising the issue (somehow I'm only seeing this now). I definitely agree that this project (which was one of my first) needs way better documentation. Unfortunately I'm currently very short on time for open-source projects, so I need to prioritise projects with more active users.

As for unicode support, I'm pretty sure this library only supports parsing one byte at I time, so if you need unicode you would have to add your own support as a specialised character parser. For UTF-8 encoded strings I think for many use-cases unicode will implicitly work anyways, but I there are probably a bunch of edge-cases that I'm not considering atm.

Whitespaces (or any other separator symbols) can be set as valid tokens that can be parsed between any two rules (essentially transforming the grammar). This still needs to be set explicitly by calling g.setSeparator(<rule>) on the grammar object, where <rule> would define a parser rule for whitespace characters, e.g. g["Whitespace"] << "[\t ]".

Hope this is still somehow relevant to you or whoever stumbles upon it. Good luck with your compiler!