H2CO3 / parsel

Generate parsers directly from AST node types
70 stars 2 forks source link

I love this! #1

Open ckaran opened 1 year ago

ckaran commented 1 year ago

Nothing more than the subject line, I just wanted to say that I love where you're going with this!

I see that you have support for ASTs, but what about concrete syntax trees (CSTs)? I've been toying with the idea of writing a code rewrite engine that can take all my poorly documented code and stub out doc comments using the actual code itself (i.e., parse to a CST, modify the CST, quote it out). I can't just use rust-analyzer for this as I'm eventually going to have to figure out how to handle multiple other languages as well.

Thank you again for working on this!

H2CO3 commented 1 year ago

Well, Parsel's syntax tree nodes are somewhere between abstract and concrete. Nodes have to include every single subproduction, including individual tokens (terminals) in order to be parsed, and so they are printed as well when ToTokens::to_tokens() is called on a syntax tree node. All the "important" information is thus preserved, but due to how proc_macro2::TokenStream stringifies, this solution is not completely lossless and so whitespace and comments are lost.

As currently standing, you could insert whitespace manually based on the Span information of each syntax tree node, but this doesn't help with comments, and it's probably pretty ugly. I think providing truly, 100% lossless CST nodes would definitely require stepping away from the proc_macro2+syn interface.

Yet another question could be: how to parse binary input? That's also not something syn supports. I think the eventual answer will be that we'll have to split up the parsing functionality into several traits, so besides deriving syn::Parse, it would be possible to derive CST and binary parsers as well. An even better solution would be to generalize all of this so that the derived parsers and the helper types (e.g. Punctuated or Maybe) are agnostic of the low-level lexical representation. I'm pretty sure that's a lot of work, but I'm open to eventually generalizing and extending the crate along these dimensions.

ckaran commented 1 year ago

but due to how proc_macro2::TokenStream stringifies, this solution is not completely lossless and so whitespace and comments are lost.

😫

Yet another question could be: how to parse binary input? That's also not something syn supports. I think the eventual answer will be that we'll have to split up the parsing functionality into several traits, so besides deriving syn::Parse, it would be possible to derive CST and binary parsers as well. An even better solution would be to generalize all of this so that the derived parsers and the helper types (e.g. Punctuated or Maybe) are agnostic of the low-level lexical representation. I'm pretty sure that's a lot of work, but I'm open to eventually generalizing and extending the crate along these dimensions.

I like the idea of fully generalized parsers, I can see a lot of different uses for binary parsers, especially if they can be represented by CSTs. The big trick is that you can implement intelligence into the structs and enums themselves that you derive the parser traits on. This can include additional rules that are hard to put into a parser, but easy to check after the inputs have been parsed. E.g., you're parsing a binary format that includes a checksum. The parser will tell you if the message is well-formed, and with the CST you can walk the relevant parts of the information to verify that the checksum holds. Or you can mutate the incoming message in place, calculate a new checksum, and quote it out. Lot's of uses!