josevalim commented 1 year ago

Hi everyone,

This is as second take on "building environments for code fragments" for IDEs (follow up on #12566). The goal of this issue is to propose a rich data structure for tracking fragment/buffers. Let's start with some background.

Status quo

Currently, Code.Fragment contains functions that analyze where we are in a code snippet:

cursor_context tells us where we are in a token
container_cursor_to_quoted tells us where we are in the AST (improved in Elixir v1.15 to provide additional context)

So while Elixir can tell that you are inside a alias Enumera call or that you are in the middle of typing Code.Fr, it does not provide any mechanism for telling you that Code.Fragment is a valid autocompletion for Code.Fr.

In order for us to provide such completions, we need to consider three environments:

The fragment/buffer environment. This is a file in your editor, a cell in Livebook, or a prompt in IEx. This file may or may not be valid Elixir.
The eval environment. This comes from previous cells in Livebook and previously executed code in IEx. AFAIK, it does not exist in IDEs.
The global environment. This is all modules, dependencies, etc. Exactly how those modules are compiled and searchable is a separate problem.

The goal of this issue is to build infrastructure to solve the fragment/buffer environment.

Fragment/buffer environment

Take the following code (think of text in your editor):

defmodule Foo do
  import Baz

  def bat do
    var = 123
    {<CURSOR>
  end

  def local_function do
    ...
  end
end

Our goal is to know what is the fragment/buffer environment so we can use it for completion and other features. For example: what has been imported from Baz that is available inside bat? What are the local functions visible from bat? And so on.

After talking with @michalmuskala, we want to approach this problem using the following techniques:

First the whole file is loaded as a fragment/buffer
We will process the fragment using a robust parser that will be capable of ignoring incomplete expressions, such as {in the example above
Once the fragment is processed, we will build a tree of scopes. For example, both bat and local_function, each delimited by their do/end blocks, depend on the scope from defmodule Foo do
As the user changes those files, we will replicate those diffs on top of the fragment. For example, if a line is added, then we add said line to the buffer in a way we can reuse scopes and minimize additional computations

Therefore, the goal is to always be able to quickly answer which functions, variables, imports, aliases, are available at any point in a fragment, ultimately reducing latency in IDE features. I believe this will allow us to replace elixirsense in a way that is robust and performant.

The tasks above are major efforts in themselves, so we will need further issues to break them down, but this issue introduces a general direction.

PS: it is yet unclear if the buffer will track function/macro calls - in order to provide go-to definition and similar - or only the environment. However, note that today it is not trivial to provide mouse-over suggestions with surround_context because it doesn't tell us the arity. Livebook addresses this by explicitly looking after |> operators, but that does not cover all cases.

/cc @scohen @scottming @lukaszsamson @mhanberg

mhanberg commented 1 year ago

Hi @josevalim! Thanks for including me on this.

For now I don't have too much to add other than observe, but in the coming weeks I will be much more up to speed.

Thanks again!

lukaszsamson commented 1 year ago

The eval environment. This comes from previous cells in Livebook and previously executed code in IEx. AFAIK, it does not exist in IDEs.

It depends on how you classify things. Is app config a part of eval env? Is static analysis? ElixirSense can infer that my_var has property hour in the following snippet

my_var = %DateTime{}
my_var.
              ^

josevalim commented 1 year ago

Anything that you conclude from looking outside of the current buffer (such as app config) is either part of the eval config (this is an environment that directly affects the __ENV__ the buffer is evaluated on) or part of the global config. So knowing a variable set in the buffer has a certain struct is part of the buffer environment.

scottming commented 1 year ago

Hi, Jose. Thank you for putting more effort into supporting LS.

Let me try to understand your question and goal: so the main goal is:

to propose a rich data structure for tracking fragment/buffers

And first

The goal of this issue is to build infrastructure to solve the fragment/buffer environment.

And after achieving it, it will make it easier for LS authors to implement completion and navigation functions. I have no objections to this overall goal. I believe it will indeed make our life easier.

How does Lexical handle completion?

Then I want to talk about how Lexical handles completion. And what problems we encountered when using Code.Fragment. i think this will be helpful to you when considering use cases while implementing the above goals.

A key feature of Lexical's completion is context awareness: for example, only modules are completed after alias <CURSOR>; only structures are completed after %St<CURSOR>; parameters are reduced after |> fun<CURSOR>, and so on. It can be divided into two stages:

one stage is to determine what kind of context the cursor is in;
the second stage is to provide available completions for that context.

Currently, we use Env.in_context?( method, but it relies on Elixir's private API :elixir_tokenizer.tokenize/4. I know this is not good, and we are trying to avoid it as much as possible. For example, in the Env.in_context?(env, :struct_arguments)' implementation, we used container_cursor_to_quoted + Macro.path instead. However, we cannot completely avoid relying on tokenize/4 for all context implementations, such as function capture examples:

      text = ~q(
      defmodule MyModule do
        &MyModule.fun|
      end
      )

      env = new_env(text)

      quoted =
        env.document
        |> Document.fragment(env.position)
        |> Code.Fragment.container_cursor_to_quoted()

      quoted |> dbg()

The result is

quoted #=> {:ok,
 {:defmodule, [line: 1],
  [{:__aliases__, [line: 1], [:MyModule]}, [do: {:__cursor__, [line: 2], []}]]}}

We can't use that result to determine the cursor in a capture context, though it is. This is the problem we encountered in the first stage.

As for the second stage, I found it difficult to know what was around my cursor just by using the current API. Taking struct_fields as an example: %AliasedModule.Struct{field: |}, it is already in a struct_argument context, then I have to know what AliasedModule.Struct is. To achieve this goal, I have no choice but to traverse through the entire AST to implement an alias_mapping/2.

This may also pose similar problems when implementing navigation. For example, if my cursor is in Parent.MyM|odule.Child, then I need to know that the current cursor position is Parent.MyModule alias; currently, the API provided by Code.Fragment cannot achieve this goal. As for the confusing result provided by cursor_context, Steve has already explained it in detail in another issue.

josevalim commented 1 year ago

It can be divided into two stages:

Exactly. This issue is about the second stage.

We can't use that result to determine the cursor in a capture context, though it is. This is the problem we encountered in the first stage.

Every time you run into a scenario like this, please open up an issue. I will tackle the & one shortly. So please submit bug reports for every case, so you no longer need to rely on tokenizer. :)

To achieve this goal, I have no choice but to traverse through the entire AST to implement an alias_mapping/2.

Correct! That's what we want to solve with this issue in a way that is general and efficient.

This may also pose similar problems when implementing navigation. For example, if my cursor is in Parent.MyM|odule.Child, then I need to know that the current cursor position is Parent.MyModule alias

This should be handled by surround_context + container_cursor_to_quoted (that's how we do it in Livebook).

josevalim commented 1 year ago

The & behaviour is fixed on Elixir main and v1.15 branches.

scottming commented 1 year ago

The & behaviour is fixed on Elixir main and v1.15 branches.

Cool, thanks; so fast.

So please submit bug reports for every case, so you no longer need to rely on tokenizer. :)

Ok, I'll do that.

This should be handled by surround_context + container_cursor_to_quoted (that's how we do it in Livebook).

Ok, I think that part of the code is in identifier_matcher.ex#L362, I'll explore it when do the navigation task.

scohen commented 1 year ago

This is an interesting approach, @josevalim. I really like the tree approach, and in order to keep things fast, it might help to review how the LSP protocol works, as I would like to avoid any impedance mismatches between the proposal and the LSP spec.

Every time the user types a character, a TextEdit is generated, and sent to the language server. This edit is then applied to the document, at which point things like completion happen. Text edits contain two positions, a start and an end, and positions are before the character they mention (a character of 0 occurs before the start of a line). It's also worth noting that a "character" isn't really a character, but a utf16 code unit offset (dealing with utf16 is very...fun) .

If I'm understanding you correctly, we could couple this new data structure to a document when it's opened by the user, so we'd have both a textual representation (The Document in lexical) and an ast / context representation that we're discussing here. Then, when an edit comes in, it is simultaneously applied to both the textual and AST representations.

Generally, I think this would be fine for us to do; however, edits can affect both large portions of a document (think cut and paste, and search and replace) or very small ones (every character typed or deleted by the user comes to the server as a text edit). I would think that over time, it might become necessary to occasionally collapse the edit history of the documents in order to save memory, as we don't really care much abut history when building language servers.

One of the tougher problems I encountered before I started the lexical project was implementing order aliases. Aliases are a surprisingly complicated problem, and I think they illustrate why this scope approach would be greatly beneficial. If I'm understanding correctly, given the following document:

defmodule Testing do 
  alias Foo.Bar.Baz
  alias Baz.{Top.Bottom,
    Bottom.Right
   }
   alias Right.Left

we'd able to say things like: aliases_at(doc, {6, 16})

and that would return:

%{'Right' => 'Foo.Bar.Baz.Bottom.Right'
    'Bottom' => 'Foo.Bar.Baz.Bottom',
    'Baz' => 'Foo.Bar.Baz'
}

If so, this would be a great help. We currently have to build this functionality ourselves.

josevalim commented 1 year ago

You got it right. Further context:

We will keep both textual and AST representations (we need both anyway)
We won't keep the edit history ourselves
Help: Can you please link to the TextEdit spec so we use that as a driving force?
Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

scohen commented 1 year ago

Help: Can you please link to the TextEdit spec so we use that as a driving force?

I sure can. Here's the "docs" for text edits. One thing to note (and that I've asked for before) is that text edits and their positions used zero-based lines and "characters", while almost all of elixir uses one-based lines and columns. I would love it if we could tell the compiler to be zero-based. I've relegated most of the conversions to framework code, but it'd be cool to not have to do this at all.

Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

Check my work, but I don't think we can get away solely with lines, as this is valid in elixir

alias Foo.Bar.{Baz, Baz.First, Baz.Second}

the aliases for Baz.First and Baz.Second would definitely benefit from knowing that Baz was aliased. This works pretty darn well in iex though.

josevalim commented 1 year ago

Check my work, but I don't think we can get away solely with lines, as this is valid in elixir

I am wondering if this is actually an Elixir bug? I am not sure if we should do left-to-right aliasing like that.

EDIT: On the other hand, {alias(:bar, as: Bar), Bar} returns {:bar, :bar}, so it looks correct.

zachallaun commented 1 year ago

4. Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

Apologies if I'm missing something, but would it be sufficient to store the line/col start and end on each node of the "tree of scopes" as its being constructed? Then answering questions about scope for an exact line/col can be done by drilling down to the leaf containing that line/col and collecting the path of nodes along the way.

scohen commented 1 year ago

I am wondering if this is actually an Elixir bug? I am not sure if we should do left-to-right aliasing like that.

As far as I can tell, aliases have always supported this (which is just a special case of top-down aliases), but I didn't invent the language, so I don't know what the intention was. 😉 I'm not a fan of this kind of code, but it doesn't seem like a language bug to me.

These are all equivalent:

alias Foo.Bar.{Baz, Baz.First, Baz.Second}

alias Foo.Bar.{
  Baz,
  Baz.First,
  Baz.Second
}

alias Foo.Bar.Baz
alias Baz.First
alias Baz.Second

I'm pretty sure I've seen code like the above in the wild, so it might be too late to take it away.

As an aside, my organize imports code sought to normalize all aliases to eliminate interdependent aliases, and transform all the above cases into:

alias Foo.Bar.Baz
alias Foo.Bar.Baz.First
alias Foo.Bar.Baz.Second

It worked too, but it was extremely complicated.

lukaszsamson commented 1 year ago

@josevalim There is no bug here. @scohen this syntax is a bit simpler than what you described. There is actually no aliasing left to right inside {} and Baz being aliased does not make any difference (at least on 1.14)

iex(1)> alias Foo.Bar.{Baz, Baz.First, Baz.Second}
[Foo.Bar.Baz, Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases
[{Baz, Foo.Bar.Baz}, {First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

iex(1)> alias Foo.Bar.{Baz.First, Baz.Second}     
[Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases                      
[{First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

But yeah, there are a few other interesting cases with aliases (unaliasing, submodule aliases, __MODULE__ in alias directives, aliasing erlang modules, alias via require, as:)

scohen commented 1 year ago

err, mine (elixir 1.14.3) does differ between the two attempts

iex(1)> alias Foo.Bar.{Baz, Baz.First, Baz.Second}
[Foo.Bar.Baz, Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases
[{Baz, Foo.Bar.Baz}, {First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

iex(1)> alias Foo.Bar.{Baz.First, Baz.Second}
[Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
__ENV__.aliases
[{First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

(no Baz)

Those look different to me, but that's not the point I'm making here. The thing is that this is allowed, and it's important to know that Baz has been aliased to Foo.Bar.Baz when making completions inside the curly braces, and thus, it's important to know the column (I think).

I do think I see your point though:

iex(3)> alias Foo.Bar.{Baz.Super.Deeply.Nested, Nested.One, Nested.Two}
[Foo.Bar.Baz.Super.Deeply.Nested, Foo.Bar.Nested.One, Foo.Bar.Nested.Two]
iex(4)> __ENV__.aliases
[
  {Nested, Foo.Bar.Baz.Super.Deeply.Nested},
  {One, Foo.Bar.Nested.One},
  {Two, Foo.Bar.Nested.Two}
]

A bit surprising.

josevalim commented 1 year ago

Those look different to me, but that's not the point I'm making here. The thing is that this is allowed, and it's important to know that Baz has been aliased to Foo.Bar.Baz when making completions inside the curly braces, and thus, it's important to know the column (I think).

From looking at the examples, this does not seem the case. They are all based from the root Foo.Bar. If you swap the order around, you get the same result. :) Good.

scohen commented 1 year ago

So, I think lines will be sufficient then

scohen commented 1 year ago

@josevalim I'm starting work on find references, and the first thing I've come across is context-aware knowledge of what exactly is under the cursor.

For example, say I have the following

defmodule ParentModule do 
  defstruct [:key, :value]

  def function(%__MODULE__|{} = parent) do 
    ...
  end
end

(the cursor is |) above It would be great for me to be able to ask: "hey, what's going on under the cursor" to something and have it tell me "Why that's a struct reference, the module is __MODULE__.

I've tried using Code.Fragment for this purpose, but it returns {:local_or_var, '__MODULE__'}, which omits the important information that this is inside a struct. This is necessary, because the definition of the module is at a different place than the definition of the struct.

I've noticed that both elixir_sense and lexical need and implement this functionality in the same way: they use the tokenizer. I know that you don't like this, since the tokenizer is private, so I'd like to find some way to use a public API in the future.

So a couple of questions:

Do you agree that this is something that core elixir should do? a. If not, what approach would be tenable aside from using the tokenizer.
Is this the right place to chat about it, or should I make a new issue?

Thanks!

By the way, I've implemented an initial version of the environment that works with aliases, if you're interested.

josevalim commented 1 year ago

@scohen I think the ElixirForum is a good place to have these conversations. However, the answer is not going to be different from what I answered in the past: you will always need a combination of cursor_context and container_cursor_to_quoted. I also think elixir_sense is moving away from the tokenizer but doing so requires depending on more recent Elixir versions, so it has been a gradual process.

scohen commented 1 year ago

This is more a question of dueling implemtattions.

josevalim commented 1 year ago

It is up to you really. I won't be able to start on this immediately and, if this feature needs changes to the tokenizer/parser, it will only be available on Elixir v1.16. You can either:

Wait 6 months until this feature is released (you can contribute your existing work if you believe it can speed up)
Work on your version because you can't afford to wait until v1.16

scohen commented 1 year ago

Gotcha. Well, we definitely don't want to wait six months; I'll post what I have and see if it helps, and is the direction you want to take things.

mhanberg commented 10 months ago

@josevalim hope you're doing well.

I was experimenting with Code.Fragment.container_cursor_to_quoted/2 today, and ran across some behavior that I can't tell if it's a bug or if I'm misunderstanding a nuance of the function.

I have the "repro" in a .livemd gist that you can load into livebook, but I'll copy paste it here for posterity. https://gist.github.com/mhanberg/188c073bfe31965d147590b0011aef87

container_cursor_to_quoted

Section

The following demonstrates what Code.Fragment.container_cursor_to_quoted/2 will do when you have the cursor inside of an anonymous function, and there are previous siblings earlier in the AST.

I expected for the children expressions inside the parent container to be preserved in the AST (meaning, the anonymous function, match, and case expressions), but I'm not entirely sure if that is intended.

code =
  """
  defmodule Foo do
    def bar() do
      out_of_reach = "yes"

      String.capitalize(out_of_reach)
    end

    def foo() do
     foo = ["one", "two", "three"]

     foo
     |> Enum.map(fn item ->
       {one, two, three} = 
          case item do
            _ ->
              {1, 2, 3}
          end 

       l
  """
  |> String.trim()

{:ok, ast} = Code.Fragment.container_cursor_to_quoted(code, columns: true)

ast |> Macro.to_string() |> IO.puts()

output:

defmodule Foo do
  def bar() do
    out_of_reach = "yes"
    String.capitalize(out_of_reach)
  end

  def foo() do
    foo = ["one", "two", "three"]
    foo |> Enum.map(__cursor__())
  end
end

josevalim commented 10 months ago

Yes, this is intentional. The goal of container_cursor_to_quoted is to find the container AST, the call the current cursor is happening. It is not mean for parsing code. If you want parsing, you probably want #11967.

mhanberg commented 10 months ago

Just to clarify a bit further, to make sure that I fully understand, I would have expected this to behave like the following code.

The point I want to make clear is that in this example, the parent container is the first clause of the case expression, which has the siblings maintained.

In the above example, I believed the parent container to be the anonymous function, but instead it appears to be the parameter list of Enum.map.

fn i, f ->
  case i do
    x -> 
      x + x
      f

which results in

fn i, f ->
  case i do
    x ->
      x + x
      __cursor__()
  end
end

I should have led as well with the context of my question, apologies for that. I am implementing auto complete for Next LS, and was attempting to use container_cursor_to_quoted to get the AST above the cursor, so that I could collect the local variables that are in scope to give as completion items.

If your intuition still thinks I'm conceptually thinking about #11967, i'll take a look at that.

Thanks José 🙇

josevalim commented 10 months ago

Correct. It works for case because case is considered to be the call, so we keep all of its context. But fn is never considered a call by itself, it is always elided. For a function, we still want to know it is an argument to Enum.map, so we can show proper completion/signature.

I am implementing auto complete for Next LS, and was attempting to use container_cursor_to_quoted to get the AST above the cursor, so that I could collect the local variables that are in scope to give as completion items.

Yeah, container_cursor_to_quoted is not going to help with that and that's what this current issue is meant to solve. As per the on-going discussion, there is a lot of work on making sure this functionality is implemented efficiently and correctly, so you are not parsing the whole file on every key stroke.

josevalim commented 10 months ago

For what is worth, we could make it so it keeps fn but it wouldn't necessarily make this function useful for your use case. :)

zachallaun commented 10 months ago

For what is worth, we could make it so it keeps fn but it wouldn't necessarily make this function useful for your use case. :)

I'm personally in favor of this! It makes more sense to me and is more information, without (I don't think) giving anything up, since you can still use Macro.path to see that you're in the first argument position of Enum.map.

mhanberg commented 10 months ago

@josevalim I see!

I think I'm fully up to date now on this topic, thank you for clarifying and apologies that it took me so long.

I think I can make do for now, but will circle back very soon to help move this forward.

Thanks!

mhanberg commented 10 months ago

@josevalim what are you imagining for the return value of this "robust" parser?

Would it return the standard AST shape, but have environment information in the meta?

And then I think you said something about incrementally updating the syntax tree, I'm curious if you have any thoughts on the API design of that.

And with that being said, in your mind, what is the minimal patch you would like to see as a the first round of this new feature?

Apologies if I'm misunderstanding something, been a long thread and I'm still getting my bearings haha.

josevalim commented 10 months ago

I am thinking the data structure would be opaque and you would be able to ask questions to this opaque structure. I am not sure yet on what is the first patch, it seems I need to put more work/thought into this to move forward.

mhanberg commented 10 months ago

Gotcha, that makes sense! No worries, I'll be here to answer any usage questions you have (which is the whole point of this tracking issue 😅)

The idea of making it opaque sounds good to me tho, as a function api would be easier to upgrade rather than if i have code that is using pattern matching on the structure everywhere.

and fwiw, for my use case in Next LS, since I distribute it via burrito, I'll be able to track the main branch of Elixir and not have to wait for a final release.

zachallaun commented 10 months ago

I am not sure yet on what is the first patch, it seems I need to put more work/thought into this to move forward.

One thing that might be helpful is getting more concrete with a proposed API.

I could imagine an initial "query API" that takes inspiration from Macro.Env and Code.Fragment, but instead of an env/AST, they'd take a buffer and a position.

An update API would take a buffer, a string, a start position, and an end position (representing an exclusive range) and replace the content of that range in the buffer with the given string.

Figuring out a concrete API that we're trying to implement would answer a lot of questions, I think.

josevalim commented 10 months ago

@zachallaun I think a proof of concept would be:

being able to parse a fragment
being able to apply patches to the fragment
and then ask what are the aliases or imports at a given line/column

?

lukaszsamson commented 10 months ago

I'd add to the list

variables and module attributes and where they are defined/used It's surprisingly difficult to accurately pinpoint variables and their references from the AST alone. Macro.Env is not helpful as well.

josevalim commented 10 months ago

Yeah, for sure, but we were talking about a proof of concept and aliases/imports is a smaller starting point. :)

scohen commented 10 months ago

The lexical team went ahead and implemented something for aliases and imports already, though it depends on the sourceror package. Currently, the implementation is simplistic, and it walks the source code every time it's called, but it should be relatively easy for it to emit a structure that can be queried. I intended to implement that this week.

Would you be interested in this as a starting point?

lukaszsamson commented 10 months ago

elixir_sense also extracts that info from AST in https://github.com/elixir-lsp/elixir_sense/blob/master/lib/elixir_sense/core/metadata_builder.ex. It breaks with code using metaprogramming and requires basically reimplementing what the compiled does

zachdaniel commented 10 months ago

I have a question about this which may lead to feedback/ideas: Is there some possible way to have this evaluate macros with do blocks, perhaps with an empty do block if it's not syntactically complete? The reason I bring this up:

Ash has lots of nested DSLs, built by spark (https://github.com/ash-project/spark). For example:

json_api do
  routes do
    index :...
  end
end

Autocomplete with elixir_sense can detect the json_api in the top level, but it currently will never see routes and index in that scenario. Spark DSLs are extensible and even patchable, so what this means is that what is imported by each macro is not static, it requires evaluating the macro to know. We've built a custom elixir_sense extension to solve for this problem.

I'm not sure what the appropriate solution would be, or if one is even reasonable, but if so it would be very helpful and can eliminate tons of pretty brittle code.

EDIT: some context is that we can't ensure that all DSL extensions provide only unique names everywhere, so we can't just import everything at the top level. We have to import inside of each macro.

zachallaun commented 10 months ago

Is there some possible way to have this evaluate macros with do blocks, perhaps with an empty do block if it's not syntactically complete?

Agreed that this would be extremely nice to have. I know that elixir_sense has an entire module dedicated to macro expansion, and you have to do some pretty gnarly stuff to expand even something like use MyModule, like evalling the string "require MyModule; __ENV__" so that you can get a Macro.Env that knows how to expand MyModule.__using__.

That said, I think it might be addressed at a later point. I believe this issue is focusing on the static analysis that can be done on a file without any evaluation/global context at all -- just answering questions about what is what from a syntax perspective. So in your example @zachdaniel, if I'm understanding correctly, this issue seeks to provide you with information like "json_api, routes, and index are imported calls", but not information about where they came from.

We'll have to combine "what we know from the text" with "what we know from the runtime" at a later point to enrich things.

zachdaniel commented 10 months ago

Well, it does import real modules, so I'd be more than fine if it just evaluates to determine what modules are imported in that do block(somehow). Something I've also been considering lately, which is almost certainly far too tacky to be built into core, but could perhaps be built into language servers, is a module that gets called saying "Hey, I'm evaluating this macro, give me information about the env inside of the do block for it". That could generalize my work with the elixir_sense extension and make it much simpler. Granted, our elixir_sense extension does additional things above and beyond that, but this would be so much simpler.

zachdaniel commented 10 months ago

But yes, understood that it is out of scope for this, will save the conversation for another time 👍

scohen commented 10 months ago

One request that I have is that we make whatever solution more like Macro.Env and less like the map-based interface that iex uses. I find maps to be much too flexible of an API, and it's not clear what keys the maps have, or what values might be associated with those keys.

mhanberg commented 9 months ago

I've begun working on

We will process the fragment using a robust parser that will be capable of ignoring incomplete expressions, such as {in the example above

I will have a repo to share soon, please let me know @josevalim if you have any specific things in mind for it.

Happy holidays everyone!

josevalim commented 9 months ago

The code will be important, of course, but I think the most important will be a test suite with many cases we would like to be automatically fixed. :)

mhanberg commented 9 months ago

The code will be important, of course, but I think the most important will be a test suite with many cases we would like to be automatically fixed. :)

Roger!

mhanberg commented 6 months ago

I have a fully compliant and error tolerant parser ready (obviously not including any bugs or syntax I haven't encountered yet) here: https://github.com/elixir-tools/spitfire.

It has a test suite full of plenty of test cases, but it also tests against about 30+ popular Elixir projects on GitHub and can parse them identically to Code.string_to_quoted(token_metadata: true, columns: true) as of Elixir commit 52eaf1456182d5d6cce22a4f5c3f6ec9f4dcbfd9, which has the bug fixes that I and José have made since I started working on the parser.

There is still room for more and better error tolerance, but I am now moving onto the next step, which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

The one thing for the future that I'm not sure how to accomplish yet is expanding macro, as they (particularly the use macro) often injects imports and aliases into a module, bringing them into scope.

@josevalim please let me know if you have any thoughts so far or concerns! I've already gone further than I originally planned before checking back in 😅.

josevalim commented 6 months ago

@mhanberg excellent job!

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

Btw, have you had the chance of benchmarking the parser? Because a hand-written parser can lead to better error messages and custom fault-tolerance, but I worry it makes maintenance harder. But I wonder if it also leads to better performance.

which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

May I propose something else as the next step?

One of the things I have described above is that we need to track "blocks". For example, if you write:

defmodule Foo do
  def bar do
    ...
  end
end

If you do changes inside bar, you don't need to worry about changes elsewhere. So one of the things we need to solve is: how to receive diffs from editor and then apply it to whatever representation we will have of the source code structure.

Building of the environment and querying of the data is most likely something that needs to exist in Elixir, because expanding the macros and collecting AST info will depend on private APIs, so we need to figure out the appropriate moment and API to hand-off those to Elixir.

lukaszsamson commented 6 months ago

There is still room for more and better error tolerance, but I am now moving onto the next step, which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

elixir_sense has a pretty comprehensive implementation that extracts info from AST though I admit the data model used there is far from optimal. Most of the test suite can be found in https://github.com/elixir-lsp/elixir_sense/blob/master/test/elixir_sense/core/metadata_builder_test.exs

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

ElixirLS, Lexical already vendors it as well as Sourceror library. There is a need for tokenizer as a library. It could also improve on error tolerance.

So one of the things we need to solve is: how to receive diffs from editor and then apply it to whatever representation we will have of the source code structure.

LSP editors send content modification messages https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_didChange, the server applies those messages and constructs the current version of the document. Those messages are on character ranges level so theoretically we could map it to token ranges and then to AST nodes.

The one thing for the future that I'm not sure how to accomplish yet is expanding macro, as they (particularly the use macro) often injects imports and aliases into a module, bringing them into scope.

elixir_sense has some limited support for macro expansion, see https://github.com/elixir-lsp/elixir_sense/blob/d7b12a11df4bcacfa64afb23b5566e64509b740d/lib/elixir_sense/core/metadata_builder.ex#L1298-L1305

mhanberg commented 6 months ago

excellent job!

Thanks! 😄

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

I had considered it, because there are some changes that could be made to tokenize as it parses, rather than parse it all up front. I punted it because I wasn't sure what parts of this work would get upstreamed into core (the Spitfire parser itself, or one of the other components yet to be built, or none or all of it).

If Spitfire or some section of it was in core, I would assume it would be able to use private API. I am open to anything tho!

Btw, have you had the chance of benchmarking the parser? Because a hand-written parser can lead to better error messages and custom fault-tolerance, but I worry it makes maintenance harder. But I wonder if it also leads to better performance.

No benchmarking yet, I've only observed parsing times for whole projects (includes tokenizing time). For example, it parses all 272 source files in elixir-lang/elixir (134,242 loc) in 265ms and 264 files in absinthe (24,753loc) in 17ms

I'll do some actual benchmarking here soon

May I propose something else as the next step?

One of the things I have described above is that we need to track "blocks". For example, if you write: ...

I was planning on punting this to later because I wasn't exactly sure how to accomplish it 😅.

I'm open to moving to this part tho is that is what your gut is telling you. I had just assumed it was an optimization and opted to getting the "full stack" thing working first.

I still need to do some research into how this is accomplished elsewhere, but the immediate thoughts/concerns I have are:

how to handle line/column related metadata? if we incrementally tokenize/parse changed ranges of the document, the metadata will get out of sync. I think this may have impact on getting the environment at the cursor position if the coordinates in the AST are incorrect.

I guess it's just the one concern for now, actually.

Building of the environment and querying of the data is most likely something that needs to exist in Elixir, because expanding the macros and collecting AST info will depend on private APIs, so we need to figure out the appropriate moment and API to hand-off those to Elixir.

Another callback to my previous comment about not knowing where the lines for inclusion into core should be called.

As I was source diving more compiler internals last night, the thought did occur to me that perhaps to accomplish this, we are somewhat building another compiler frontend. Also while reading through LSPs in other languages source code, I saw that rust-analyzer's description is "A Rust compiler front-end for IDEs"

It seems like eventually the steps may be similar to what exists, except it uses the error tolerant parser and after expanding everything it doesn't compile into Erlang and just leave it as a data structure and/or emit tracer events.

I'm curious to your thoughts on that, but we can leave it for another time to not get off topic.

Thank you José!

elixir-lang / elixir

Environments for code fragments/buffers (i.e. elixirsense) #12645

Status quo

Fragment/buffer environment

How does Lexical handle completion?

container_cursor_to_quoted

Section