elixir-tools / spitfire

Error tolerant parser for Elixir
https://www.elixir-tools.dev
MIT License
69 stars 7 forks source link

Proposal: Introduce some hook to change encoding of parsed result #18

Open zachallaun opened 5 months ago

zachallaun commented 5 months ago

Hi Mitch! First off, thanks for your work on Spitfire. It's shaping up really well and is already a valuable tool!

Second, to hedge a bit, I don't know whether implementing this proposal would be appropriate right now given Spitfire's current state of development, but I wanted to bring it up now regardless.

Proposal

Introduce some hook, be it a callback, behaviour, or something else, that controls the encoding of the parsed result. The default would be to emit Macro.t(), as Spitfire currently does, but the hook would allow parsing Elixir code into a bespoke AST of arbitrary format.

As a concrete example, you might imagine the following:

Spitfire.parse!("1 + 2.5", encoder: &custom_encoder/1) # not sure what actual encoder arity would be
#=>
%BinaryCall{
  op: :+, 
  meta: %{...}, 
  left: %Integer{value: 1, meta: %{...},
  right: %Float{value: 2.5, meta: %{...}
}

Context

Elixir's AST is intentionally minimal. One reason is to facilitate authoring macros. For example:

foo(x: 1, y: 2)

# parses to:
{:foo, [], [[{:x, 1}, {:y, 2}]]}

# as opposed to:
{:foo, [],
 [[{:{}, [], [:x, 1]},
   {:{}, [], [:y, 2]}]]}

This makes it trivial to use keyword lists as options in macros, but also means that code processing an AST has to handle two forms of tuple. This is only one example of where the default AST can be cumbersome, but there are many situations where complex pattern matching, guards, or even metadata inspection are required to precisely differentiate syntax.

Sourceror, for instance, has a long standing issue for an enriched AST.

Additional Considerations

mhanberg commented 4 months ago

Hi, sorry I've been off the computer for a while due to a back issue, just getting back in.

I'll give this a thorough read this week and reply.

zachallaun commented 4 months ago

@mhanberg Zero worries! This is low prio. Feel better.

mhanberg commented 3 months ago

@zachallaun I think the premise is reasonable, not entirely sure how it would look off the top of my head.

Naively, I think you could wrap every line that constructs a node with a function that invokes either the node_encoder or does the default behavior. There are probably some places where the AST is being changed in different places after its constructed, which might make it difficult.

Please submit a spike whenever you get some time, it would be helpful to see what it might look like. It's entirely possible that a pluggable encoder still won't be able to achieve the desire you have, so it'd be good to spike out a more intricate case.

As a concrete example: when parsing Foo.Bar.\nBaz (note the newline), it's not possible to determine which line Bar occurs on without inspecting the source, but with token data, it would be.

Side note, this in particular could potentially be upstreamed to the core yecc parser, and then I could mimic the behavior into Spitfire.