Question: proper way of matching unicode categories?

ethindp commented 1 year ago

So I'm writing an Ada parser (I've tried a few different tools) and I came across this (thank you, This Week in Rust) and I found it fascinating (talk about an inventive way of writing a parser). The Ada grammar uses Unicode GCs for matching identifiers. (It also has custom separators, some of which aren't in parsel::ast::token, so I'm unsure how to match those). I have a couple options for doing this:

I have a custom lexer already. But I'm unsure how I'd incorporate this into parsel. I could just use that.
I could use embedded Unicode tables. (This would be rather annoying, however, since I'd have to expand it, which could get costly, very quickly.)

Are there any other options or suggestions that I could try? Or, if not, how could I accomplish the incorporation of my Lexer? (It defines a custom Token struct, but I could probably transform it into something that Parsel would like.)

H2CO3 commented 1 year ago

Hey there!

If I understand correctly, this is a lexer-level problem. In which case I would think that solution 1 should work. Parsel is only concerned with parsing, not lexing, as it operates on an already lexed and pre-parsed representation of strings, a TokenStream.

But I'm unsure how I'd incorporate this into parsel.

proc_macro2::TokenStream is the output of the Rust lexer. You'd have to make your lexer produce such a token stream, which can then be fed to Parsel. The only interesting part of this is that you'll have to match parentheses, since token streams are semi-structured (they require balanced parentheses/brackets/braces).

ethindp commented 1 year ago

@H2CO3 This seems like a non-trivial task. According to TokenStream's documentation, it implements Extend<TokenTree> for TokenStream, which is fine. However, a TokenTree consists of either groups, idents, puncts, or literals, but you can't create a group without creating another TokenStream. I'll look into attempting to devise a transformation function, but I'm unsure how easy it'd be (I know that some lookahead will be needed). The language I'm writing a parser for only uses parentheses and square brackets (well, and angle brackets, but those are used only in one place) so I don't need to worry about braces. Is there another way I could do this? Is there a reason this crate is so proc-macro centric and not more generalizable to any kind of stream?

H2CO3 commented 1 year ago

but you can't create a group without creating another TokenStream

Grammars with grouping are inherently recursive. You'll have to use recursion to create the groups. This is exactly the pre-parsing step I was talking about earlier.

Is there a reason this crate is so proc-macro centric and not more generalizable to any kind of stream?

The crate is not proc-macro centric. The reason it uses TokenStream as its input format is that this way it can piggyback on most of the syn ecosystem. If it didn't use the proc-macro2 TokenStream type, I would have to come up with my own equivalent and do the same pre-parsing anyway, because mechanically parsing some constructs (e.g. arbitrary repetition) is very hard or impossible without the lexer already knowing where a group ends.

In a hand-written parser, you solve this problem by looking for a token that surely can't be part of the production. Sub-production parsers generated by Parsel are, however, oblivious to their surrounding context, pretty much by definition. This is what allows them to be mechanistically generated from the AST in the first place.

H2CO3 / parsel

Question: proper way of matching unicode categories? #3