Open TheMatten opened 3 years ago
I'm quite keen generally on separating lexing from parsing and so wondered if we could do alex
+ megaparsec
. The main difficulty is I'm not sure how you lex Haskell style layout in a pleasing fashion (mostly because I have never tried it, I don't think there is a fundamental limitation). Maybe looking at how GHC does it would help.
I'm quite keen generally on separating lexing from parsing and so wondered if we could do alex + megaparsec.
When it comes to lexing, I'm worried about whitespace sensitivity - it doesn't really appear in Haskell98, but does a lot with extensions. It's not a fundamental limitation, but it will make things less nice.
The main difficulty is I'm not sure how you lex Haskell style layout in a pleasing fashion (mostly because I have never tried it, I don't think there is a fundamental limitation)
Looking at GHC's parser to see how they handle indentation vs braces, they seem to manually recognize both cases. https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GHC/Parser.y#L95 sounds horrifying BTW :sweat_smile: Wouldn't we avoid equivalent of shift/reduce conflicts when using combinators? (You just have to replace recursion with fold on list from time to time)
I wonder what to do about operator parsing - would there be any issues with parsing source that uses them as list of tokens and using separate phase to turn those into trees? It sounds more efficient and elegant than reordering left-associative tree as done in GHC.
What is the advantage of doing a separate lexing step if using parser combinators?
Possibly less space for mistakes and no need to handle whitespace in parser I guess - but Haskell is space-sensitive, so I wonder whether it won't be more convenient to do together.
Is there anything I can do to get the ball rolling on the parser?
Feel free to start - I'm slightly busy ATM, so I didn't really start anything myself. BTW, I'm sort of inclined to try multiple approaches (megaparsec
/megaparsec
+alex
/happy
+alex
), so I guess you can just pick one and we can later sort it out. Nice thing about doing toy project is that there's no reason to hurry :smile:
Once #4 is finished, we want to parse it - let's track related problems here:
megaparsec
andalex
+happy
sound like a good fit for our purposes - personally I may be more inclined to the former, though both options have their pros and cons.What about error reporting? I guess we want to be consistent across the compiler, so we'll probably want custom error printer instead of e.g. one in
megaparsec
, which is specifically tied toParsecT
state. This may actually be question worth a separate issue - I'm interested in style similar torustc
:Related to this, we want to somehow print the relevant source in later phases after parsing. Instead of pretty printing AST back, I would propose storing parsed lines of source as they are and storing line indexes along the position in line in spans in AST - we would then print the line itself (maybe truncated if needed), underlined in place of actual error span.