Proposal: Introduce some hook to change encoding of parsed result

zachallaun commented 5 months ago

Hi Mitch! First off, thanks for your work on Spitfire. It's shaping up really well and is already a valuable tool!

Second, to hedge a bit, I don't know whether implementing this proposal would be appropriate right now given Spitfire's current state of development, but I wanted to bring it up now regardless.

Proposal

Introduce some hook, be it a callback, behaviour, or something else, that controls the encoding of the parsed result. The default would be to emit Macro.t(), as Spitfire currently does, but the hook would allow parsing Elixir code into a bespoke AST of arbitrary format.

As a concrete example, you might imagine the following:

Spitfire.parse!("1 + 2.5", encoder: &custom_encoder/1) # not sure what actual encoder arity would be
#=>
%BinaryCall{
  op: :+, 
  meta: %{...}, 
  left: %Integer{value: 1, meta: %{...},
  right: %Float{value: 2.5, meta: %{...}
}

Context

Elixir's AST is intentionally minimal. One reason is to facilitate authoring macros. For example:

foo(x: 1, y: 2)

# parses to:
{:foo, [], [[{:x, 1}, {:y, 2}]]}

# as opposed to:
{:foo, [],
 [[{:{}, [], [:x, 1]},
   {:{}, [], [:y, 2]}]]}

This makes it trivial to use keyword lists as options in macros, but also means that code processing an AST has to handle two forms of tuple. This is only one example of where the default AST can be cumbersome, but there are many situations where complex pattern matching, guards, or even metadata inspection are required to precisely differentiate syntax.

Sourceror, for instance, has a long standing issue for an enriched AST.

Additional Considerations

The perhaps obvious alternative is to transform Elixir AST into whatever format you want after the fact. This has two downsides that I can think of:
1. It is slower to parse, walk, and transform than it would be to parse and output the desired result in one shot.
2. There is additional metadata/context during parsing that is not included in the Elixir AST but that could be valuable. As a concrete example: when parsing Foo.Bar.\nBaz (note the newline), it's not possible to determine which line Bar occurs on without inspecting the source, but with token data, it would be.
It could be valuable to allow this or another hook to maintain and return an accumulator as well. This might be used to collect lint violations (in additional to parse errors). Based on this comment, it seems like you're already planning to return an accumulated value in addition to the parse result.
I expect, if implemented, parsing in the default case would be measurably slower due to the overhead of the additional call whenever a node is being constructed. I'm not sure what an acceptable amount of performance loss is, but I acknowledge that there is a line somewhere. (My gut says something like 1.1-1.2x would be acceptable, while 2x would almost certainly not be.)

mhanberg commented 4 months ago

Hi, sorry I've been off the computer for a while due to a back issue, just getting back in.

I'll give this a thorough read this week and reply.

zachallaun commented 4 months ago

@mhanberg Zero worries! This is low prio. Feel better.

mhanberg commented 3 months ago

@zachallaun I think the premise is reasonable, not entirely sure how it would look off the top of my head.

Naively, I think you could wrap every line that constructs a node with a function that invokes either the node_encoder or does the default behavior. There are probably some places where the AST is being changed in different places after its constructed, which might make it difficult.

Please submit a spike whenever you get some time, it would be helpful to see what it might look like. It's entirely possible that a pluggable encoder still won't be able to achieve the desire you have, so it'd be good to spike out a more intricate case.

As a concrete example: when parsing Foo.Bar.\nBaz (note the newline), it's not possible to determine which line Bar occurs on without inspecting the source, but with token data, it would be.

Side note, this in particular could potentially be upstreamed to the core yecc parser, and then I could mimic the behavior into Spitfire.

elixir-tools / spitfire