OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.81k stars 1.95k forks source link

Rewrite the GREL parser with a parser generator #6465

Open wetneb opened 6 months ago

wetneb commented 6 months ago

The code to parse GREL expressions is was written manually and not using any parser generator. We could consider rewriting it using a parser generator, such as javacc. This involves writing a grammar for the language and using a code-generation tool to generate the parser out of the grammar. This is widely regarded a best practice to write parsers for programming languages.

Among the benefits this would bring, I can think of:

Among the downsides, I can think of:

Tagging @tfmorris who has been working in this area recently. Do you see this as a worthwhile pursuit? Assuming we want to do it, I think it would make for a good GSoC topic.

ostephens commented 6 months ago

I wonder if this would or could open up options for supporting GREL syntax highlighting and/or help in the UI? See https://github.com/OpenRefine/OpenRefine/issues/153

wetneb commented 6 months ago

Good point! Having a clear definition of the grammar would likely make syntax highlighting easier, and perhaps auto-completion too. Assuming we can use the same sort of grammar definition within some JS library which would implement a client-side parser (which I haven't looked into).

tfmorris commented 6 months ago

I have fixes in hand for all of the bugs/features listed. I've been kind of reluctant to put too much effort into getting them integrated because of the difficulty with #3257 which I abandoned.

Having a formal grammar would be a good start, but it would involve more than just codifying the current behavior. There are a bunch of tests with comments like "This seems wrong, but it's how the current code works" that should probably be evaluated and decided on.

I think syntax highlighting, code completion, etc would be implemented via a language server and the Language Server Protocol. I don't know if it works off the same type of grammar definition that you would feed a parser generator.

I guess I don't have a strong opinion for or against, but I'd add to the downsides that a new parser implementation would be a likely source of compatibility bugs.

tfmorris commented 6 months ago

p.s. in the context of GSoC, it could be positioned as an exploratory investigation / prototyping exercise, rather than something which is intended to produce production code in one go.

wetneb commented 6 months ago

Thanks for the feedback!

I agree it's not just about codifying the existing behavior. From my perspective I think it would totally be expected that the new parser would differ from the existing one in certain ways, but those ways should be clear improvements or bug fixes over the existing one.

The Language Server Protocol looks like an interesting standard. I like the idea and the fact that it is language independent would intuitively make it a good candidate to enable features for all our expression languages in a unified way. But implementing it would likely require a lot more work as it seems to be much more ambitious in the features it exposes. Some of them, such as enabling refactoring actions, would probably be of marginal interest for an expression editor.

wetneb commented 6 months ago

I had a closer look at this and the parser isn't that complicated. I think there would be more value in generating the lexer (that we call "Scanner") than the parser itself, probably. Anyway since it's not really clear it's worth the trouble, I haven't added it to the GSoC projects list.