Hejsil / mecha

A parser combinator library for Zig
MIT License
413 stars 18 forks source link

How to parse indents? #37

Closed notramo closed 2 years ago

notramo commented 2 years ago

I want to parse a format where indents are used instead of brackets, like in Nim or Python. What's the best approach to this task using this library?

Considering later stages of the parsing (AST generation), probably it would be the easiest approach to have an Indent and a Dedent token type. If one line has 3 levels of indentation, then the next only has 2 levels, then it would generate a Dedent token before the first token on the next line. This will probably need some state machine which tracks the level of the last indentation. Is this library capable of doing this?

Hejsil commented 2 years ago

Hmmm. I've never really written an indent sensitive parser, since it sounds like a pain, so I'm not sure what techniques are used.

Had a look around, and other similar parser libraries (PEG and parser combinators are closely related) have had similar discussions on this. Have a look: https://github.com/pegjs/pegjs/issues/217

It doesn't seem like mecha would be the best fit for this

notramo commented 2 years ago

What about using a converter function to push/pop from a state and emit the respective tokens? Is it reliable, or the order of function execution could mess the state machine up?

Another idea is to parse all space at the start of the line to an PrefixSpace token, which includes ^\s*, and do post processing on the tokens to swap the PrefixSpace tokens out to Indent or Dedent.

Hejsil commented 2 years ago

:shrug:

One thing to keep in mind, is that mecha basically does backtracking. If one branch fails to parse, the next one is tried. You would need functions that make sure the state machine is restored to some sensible state before mecha tries to parse a different "branch" of syntax.

Edit: To be clear, it is not really mecha per say, that "backtracks", but that is how the mecha.oneOf function does what it does. It tries all parses, one after another, until one succeeds.

notramo commented 2 years ago

I think parsing it to an Indent token, then post processing it is sufficient.

Thank you!