doorgan / sourceror

Utilities to manipulate Elixir source code
Apache License 2.0
327 stars 22 forks source link

Proposal: Expand node identification and manipulation API #105

Open zachallaun opened 1 year ago

zachallaun commented 1 year ago

I've recently found myself writing a handful of functions and guards that I think would be useful additions to Sourceror.

The motivation for this is to have a handful of primitives that allow you to differentiate between and navigate the nodes of an AST parsed by Sourceror (and to a lesser extent, the default Code.string_to_quoted options). This is based on an assumption that the fundamental "categories" of nodes in an Elixir AST are:

  1. Literals, which Sourceror may wrap in a :__block__ in order to retain additional metadata
  2. Variables
  3. Special forms
  4. Calls, either local or qualified.

Proposed guards:

Proposed functions:

Additional considerations:

Some of this may be related to the Access work you've been doing. Can it be unified?

doorgan commented 1 year ago

This would be a welcome change! I'd make the following changes though:

Also, what if instead of has_range? we add a fetch_range function with ok/error results? Or we could have both, I'm curious on the motivation for this one

zachallaun commented 1 year ago
  • is_local_call should be named is_unqualified_call, as local or non-local is a runtime property and not necessarily a syntax thing

  • is_var should be named is_identifier

Good points! Agree with both of these.

Also, what if instead of has_range? we add a fetch_range function with ok/error results? Or we could have both, I'm curious on the motivation for this one

My real motivation for this is wanting to know, regardless of what node I have, whether I should be able to call get_range on it.

My first thought was to actually suggest a change to get_range: instead of raising when it can't calculate the range for a node, it should return nil (same semantics as other get_* functions). However, that has two problems that I could think of:

  1. It would be a breaking change.
  2. Since get_range is defined recursively, it would be a non-trivial refactor to guard against nil in all the recursive cases.

Unfortunately, these issues also sort of apply to a fetch_range -- it could rescue FunctionClauseError, but that's pretty gross :P.

Given that, the next-best thing I could think of was a has_range? function!

doorgan commented 1 year ago

I see, though wouldn't the addition of has_range? need a refactor regardless? For a bunch of nodes we need to recursively check the contents to calculate the range(if possible), so this function would need to do something similar.

Regardless, I think has_range? can still be useful so I'm good with the addition :)

zachallaun commented 1 year ago

Hmm... Maybe so! Let's see what happens with the implementation and can decide then what the right API is 🙂

doorgan commented 1 year ago

Sounds good!

doorgan commented 1 year ago

Re: the Access work, that will be built on existing sourceror primitives unless some optimization is required, so it's not something to worry about :)

zachallaun commented 1 year ago

I started working on this a bit, but realized that I need a precise definition for unqualified calls.

For context, here's some "prior art" from ElixirSense:

defguardp is_call(call, params)
          when is_atom(call) and is_list(params) and
                 call not in [:., :__aliases__, :"::", :{}, :|>, :%, :%{}]

I think this generally makes sense, but the definition doesn't strike me as complete. For instance, it doesn't account for | usage in map-update syntax (%{foo | bar: :baz}), or the pin operator ^, which can only be used in a match operation or macro.

So what makes something a "call" vs. a special form? I think the critical element that can be used to evaluate something is whether that thing is definable by users without modifying the compiler.

By that definition, the following would not be considered calls because they are reserved words or forms that cannot be overridden:

This is a much larger set than what ElixirSense is using.

Given this, here's a possible definition:

Definition 1

Calls are functions, macros, and operators that operate on arguments and are definable or overridable in Elixir code.

Does this make sense and is it a useful definition?

An alternative could be to define calls more broadly and to add an additional is_special_form that identifies calls that, in essence, cannot be overridden.

Definition 2

Calls are functions, macros, operators, or special forms that operate on arguments.

If we accept this second definition, I'd think that :__aliases__ would be a call (and special form), while :__block__ would be neither.

Am I overthinking this? 😅

doorgan commented 1 year ago

@zachallaun those are good calls, but from my point of view Sourceror works at a lower level A "call" vs "special form" is a compiler and runtime thing, but Sourceror works at the syntax level, where those semantics don't matter

As far as we are concerned, with foo <- bar do ... end could be a function, a macro, or none at all

In Elixir AST you can boil it down to two scenarios: You have AST representing a node with arguments: {type, metadata, arguments} where type is an atom like :foo This is a call, and includes operators, constructors like % or %{}, etc

In some cases, a call type is another call instead: {{:., dot_metadata, dot_arguments}, metadata, arguments} Those are qualified calls, and represent the foo.bar construct, where the . is what makes it "qualified" This is the case for foo.bar, Foo.bar, and even Foo.{Bar} which is a "qualified tuple" but a call nonetheless

So from Sourceror's point of view, a call or unqualified_call is any 3-tuple node where the type is an atom and the arguments a list, a qualified call is the same but with a dot call as the type

Now, from the point of view of a tool like ElixirSense it does make more sense to narrow down the definition because it's dealing with a higher level where the language semantics do matter.

I do welcome functions that deal with a higher level for the sake of making it easier to navigate the AST considering the semantics, maybe in a separate module under Sourceror that is aware of default elixir semantics, but functions in the Sourceror module itself deal with the syntax only and assume al syntax is meaningless, as in fn or when may not be reserved in some context like macros.

doorgan commented 1 year ago

An useful thing however may be to exclude :__block__ from any definition of call because those are workarounds to represent other syntax constructs, this would match your second definition

I'm torn on :__aliases__ though, because it doesn't accept other AST as arguments and it's pretty much another hack to make it easy to extract the alias segments

zachallaun commented 1 year ago

Thanks for sharing your thoughts! This largely makes sense. I'm interested in identifying higher level semantics, but agree that it makes sense in a separate module. The existing Identifier module kinda does this, right? Perhaps it makes sense in there?

I think :__aliases__ is not a call because it's not transformative in any way -- as you mentioned, it's basically a syntax hack to represent a slightly more complex, but still "atomic", element.

doorgan commented 1 year ago

Actually now that you mention this, Sourceror.Identifier would be a great place for the functions in this proposal. That module started as a bag for vendored helpers of Code.Normalizer and I figured I could make it public

I do agree with you that :__aliases__ is not a call, so let's put it out of the notion of a "call" at the syntax level

I think the next step for semantics aware functions would be to come up with a name/place for such functions, maybe Sourceror.Code? And we rename the existing and private module with that name to something like Sourceror.Compat.Code

We can't have environment aware functions like ElixirSense may have, but we definitely can provide functions that operate on AST informed by vanilla elixir semantics

zachallaun commented 1 year ago

Sounds good!

Another thing we need is a good definition for are literals.

I can imagine two versions, one that's lower level and lives in Sourceror.Identifier and one that's higher level and lives in the new Sourceror.Code.

The lower-level would be is_literal, and it returns true for AST literals. This has a narrow definition: numbers, atoms, strings, or :__block__ nodes that contain only one child -- a number, atom, or string.

The higher-level definition would be data? and would live in Sourceror.Code. This would be similar to Macro.quoted_literal?/1, with the exception that, like is_literal, it considers :__block__ nodes containing a single literal to also be literals.

Thoughts?

zachallaun commented 1 year ago

I'm actually thinking these should be called:

Sourceror.Identifier.is_atomic_literal/1
Sourceror.Code.quoted_literal?/1
zachallaun commented 1 year ago

Regarding moving the existing Sourceror.Code -- Formatter and Normalizer already is in a lib_vendored. What if we make them all Sourceror.Vendored.*? So:

Sourceror.Vendored.Code
Sourceror.Vendored.Code.Formatter
Sourceror.Vendored.Code.Normalizer
doorgan commented 1 year ago

That sounds like a good change, so yes we can do that!