elixir-lang / elixir

Elixir is a dynamic, functional language for building scalable and maintainable applications
https://elixir-lang.org/
Apache License 2.0
24.49k stars 3.38k forks source link

Environments for code fragments/buffers (i.e. elixirsense) #12645

Closed josevalim closed 8 months ago

josevalim commented 1 year ago

Hi everyone,

This is as second take on "building environments for code fragments" for IDEs (follow up on #12566). The goal of this issue is to propose a rich data structure for tracking fragment/buffers. Let's start with some background.

Status quo

Currently, Code.Fragment contains functions that analyze where we are in a code snippet:

So while Elixir can tell that you are inside a alias Enumera call or that you are in the middle of typing Code.Fr, it does not provide any mechanism for telling you that Code.Fragment is a valid autocompletion for Code.Fr.

In order for us to provide such completions, we need to consider three environments:

  1. The fragment/buffer environment. This is a file in your editor, a cell in Livebook, or a prompt in IEx. This file may or may not be valid Elixir.

  2. The eval environment. This comes from previous cells in Livebook and previously executed code in IEx. AFAIK, it does not exist in IDEs.

  3. The global environment. This is all modules, dependencies, etc. Exactly how those modules are compiled and searchable is a separate problem.

The goal of this issue is to build infrastructure to solve the fragment/buffer environment.

Fragment/buffer environment

Take the following code (think of text in your editor):

defmodule Foo do
  import Baz

  def bat do
    var = 123
    {<CURSOR>
  end

  def local_function do
    ...
  end
end

Our goal is to know what is the fragment/buffer environment so we can use it for completion and other features. For example: what has been imported from Baz that is available inside bat? What are the local functions visible from bat? And so on.

After talking with @michalmuskala, we want to approach this problem using the following techniques:

  1. First the whole file is loaded as a fragment/buffer

  2. We will process the fragment using a robust parser that will be capable of ignoring incomplete expressions, such as {in the example above

  3. Once the fragment is processed, we will build a tree of scopes. For example, both bat and local_function, each delimited by their do/end blocks, depend on the scope from defmodule Foo do

  4. As the user changes those files, we will replicate those diffs on top of the fragment. For example, if a line is added, then we add said line to the buffer in a way we can reuse scopes and minimize additional computations

Therefore, the goal is to always be able to quickly answer which functions, variables, imports, aliases, are available at any point in a fragment, ultimately reducing latency in IDE features. I believe this will allow us to replace elixirsense in a way that is robust and performant.

The tasks above are major efforts in themselves, so we will need further issues to break them down, but this issue introduces a general direction.

PS: it is yet unclear if the buffer will track function/macro calls - in order to provide go-to definition and similar - or only the environment. However, note that today it is not trivial to provide mouse-over suggestions with surround_context because it doesn't tell us the arity. Livebook addresses this by explicitly looking after |> operators, but that does not cover all cases.

/cc @scohen @scottming @lukaszsamson @mhanberg

josevalim commented 8 months ago

I'm open to moving to this part tho is that is what your gut is telling you. I had just assumed it was an optimization and opted to getting the "full stack" thing working first.

Hrm, I think you are right. It may make sense to optimize it only later. Doing a single pass on the code and expanding all macros at once may be fine for the initial version. My only concern is that you will have to reverse engineer all of our changes to Macro.Env and, one of the concerns elixir_sense maintainers reported in the past was keeping up with Elixir changes over time (I am not sure if this concern is still relevant). So, if we could accommodate more in Elixir, it would bring more stability to everyone (as well as avoid relying on private APIs). But I'd say there is still a very open question of where to draw the line.

how to handle line/column related metadata? if we incrementally tokenize/parse changed ranges of the document, the metadata will get out of sync.

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

zachallaun commented 8 months ago

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

I vaguely recall that rust-analyzer handles this by storing line/column information as relative instead of absolute values and then lazily rebuilding the absolute values as you traverse the structure. So a node contains only its own extents (e.g. 10 cols, 1 line, or for multiline nodes, 5 cols start, 3 lines, 9 cols end). This allows you to replace nodes without concern for updating offsets elsewhere.

josevalim commented 8 months ago

I vaguely recall that rust-analyzer handles this by storing line/column information as relative

Hrm, relative to what? Maybe relative to constructs that start a new scope, such as a new function or a new conditional?

zachallaun commented 8 months ago

I vaguely recall that rust-analyzer handles this by storing line/column information as relative

Hrm, relative to what? Maybe relative to constructs that start a new scope, such as a new function or a new conditional?

I'm definitely missing something in my recollection, but I think you could do this by storing, for each node, the position relative to its parent's start line/column as well as its own extent. I'll try to find where I read about this.

Edit: Nope, that wouldn't work because a node changing its extent would affect the relative position of all later siblings, so you have the same issue.

Edit the 2nd: I believe this reference is where the idea came from.

lukaszsamson commented 8 months ago

one of the concerns elixir_sense maintainers reported in the past was keeping up with Elixir changes over time (I am not sure if this concern is still relevant)

It still is but it's not that bad. The tokenizer is not a stable API and AST metadata changes with every release but the high level concepts they represent (modules, protocols, defs, structs, attributes, vars) haven't changed much since elixir ~1.3

As I was source diving more compiler internals last night, the thought did occur to me that perhaps to accomplish this, we are somewhat building another compiler frontend. Also while reading through LSPs in other languages source code, I saw that rust-analyzer's description is "A Rust compiler front-end for IDEs"

Exactly. The tools end up reimplementing parts of the compiler (e.g. alias resolution, variable tracking, function call tracking, type inference) to be able to answer questions about high level concepts represented by code. And rust-analyzer isn't the only example of a compiler frontend serving as a base for LSP servers. MS did something similar with Roslyn for C#.

I vaguely recall that rust-analyzer handles this by storing line/column information as relative instead of absolute values and then lazily rebuilding the absolute values as you traverse the structure

@zachallaun @josevalim I've seen something similar in LSP semantic tokens that editors can use for semantic highlighting. The idea is to store token position relative to the previous token. That way whenever a token is added, moved or removed, only this token or its neighbours are affected. Maybe this could be extended to syntax trees as well. If an edit is made in a block, it should not affect the inside of a next or previous block. We would need to store which tokens (or at least first and last token) generate a particular AST node.

michalmuskala commented 8 months ago

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

As far as I know, most of the interactive-first IDEs have a setup with effectively 2 AST data structures - this was popularised by Roslyn (the C# compiler) and is roughly similar for Rust Analyzer & ELP (through Rowan), and Swift, and perhaps others.

I can talk more about how it looks like in RA & ELP.

First, the AST only stores byte offsets. Storing line + column information is extremely expensive. Early in building ELP we noticed that large amount of memory of the LSP was spent just storing position information. In particular, in Erlang [{line, Int}, {column, Int}] is 80 bytes, if you want to store full span, this becomes 160 bytes (and we really wanted full spans for good errors). In RA & ELP, most of the time position information is stored as 32-bit byte offsets, so 4 bytes for simple position, 8 bytes for full (beginning, end) span. We convert to line + column information at the edge with a simple side-lookup table that stores offsets of all newlines in the file and simple binary search.

The first AST, and the one that is persisted, is built on "Green Nodes". Those nodes only store their own length in bytes & a list of child nodes. In particular you can cheaply replace a node in the "green tree" and you only need to update the spine of the tree. By storing only the size of the node itself (including the size of children) it naturally becomes "relative" and is not invalidated by unrelated changes. Furthermore, the nodes don't have "identity", so they get de-duplicated - e.g. every def in the file becomes exactly the same def node referenced from multiple places. This further lowers memory pressure. This is also done for whole trees, e.g. all 1 + 1 pieces will be the same node (and both 1 nodes inside that will be the same node).

Storing the sizes also allows pretty quickly finding a right node given a position - starting from the root, you can do a binary search on each layer to get to the right node in O(log n) time.

However, this is not very practical for actual traversal, since position information is useful to have. To get it, during traversal you build a "red tree" that maintains proper position information from the beginning of the file. This "red node" acts, in a way, similar to a zipper - storing the current computed information + references to current, parent, and sibling green nodes. The red nodes are never stored/persisted, instead you store a (node_type, begin_offset, end_offset) tuple that can be re-formed efficiently into a red node in the next traversal from a green tree. In RA/ELP those red nodes are purely ephemeral, I think Roslyn has some caching mechanism.

zachallaun commented 8 months ago

Thank you for the proper explanation, @michalmuskala! This is indeed what I was remembering above.

Harking back to the original premise of this issue, I think this would fit well into a %Buffer{} of some kind. The buffer would not store AST as we currently know it in Elixir, but the "green nodes" that Michal described. Accessing/traversing the buffer might yield something like a %Buffer.Cursor{} (a zipper) that has a reconstructed node if valid and an error if not...?

josevalim commented 8 months ago

Thank you @michalmuskala! One of the things that are unclear to me is if the structure above is used for all files or only the currently open buffers. I understanding storing line+column (especially as a keyword) can be expensive, but I think that we can treat a buffer and a compiled module differently. Especially because in the buffer case you need to know line by line which imports, aliases, and so on are available to show for completion and you don't need that for compiled modules. So, in your experience, storing line+column is expensive even for buffers?


In any case, it is clear there is plenty of opportunity for experimentation and depending on the release cycle of Elixir will only serve to slow things down. So I think the best way Elixir can support this is by making sure we grow our compiler API surface enough to reduce the use of private APIs.

Any sort of buffer analysis tool in Elixir needs to expand macros but also specially handle some nodes. For example, Kernel.defmodule is a special node, it doesn't make sense to expand it, but rather further analyze it. Kernel.def is another.

Therefore, Elixir definitely needs to provide a lower level API for expanding macros. We do have Macro.expand_once/1, but it does the whole thing, including tracing, error handling, etc, which may not be desired.

Then you also need to be able to handle all special forms. To be honest, that does not sound very complicated. Here is a rudimentary pass for the type system in less than 300LOC. It is basically an AST traversal where you must consider which parts of the special forms are patterns/guards and how each special form modifies the environment.

The environment modification is another part that I worry about. Some of the fields in the environment are very straight-forward to manage, such as module, function, file, etc. But the fields aliases, functions, macro_aliases, macros, requires, and versioned_vars are pretty much private. So, at the very minimum, we need to add APIs for storing new aliases, imports, requires, and variables in the environment that you can use throughout expansion.

Does that sound like a viable first step to everyone? @lukaszsamson, is there anything elixir_sense needs in particular you believe is worth adding to the scope?

mhanberg commented 8 months ago

I think exposing more compiler apis is a good next step 👍.

scohen commented 8 months ago

@josevalim I think your last post has gotten me thinking. We've spent an inordinate amount of time answering the question of "Given a document, what does the environment look like at a given line and column", and have had to recapitulate a lot of what happens in the compiler and is surfaced via compilation tracers. In addition to what you've highlighted above, imports were also particularly tricky. It would be fantastic to get an API like the following:

@spec env_at(document :: String.t(), line :: pos_integer(), column :: pos_integer()) :: Macro.Env.t()

I also think it might be nice to somehow have this in a standalone library. While it's great to get changes into core elixir, it does mean that we have to do things two ways; one way for versions that don't support the new API and another way for versions that do. Lexical supports elixir > 1.13, and it will be years before it can transition to any new way of doing things post 1.16.

josevalim commented 8 months ago

@scohen, that's pretty much what we have been discussing. :) We want to have a similar API to what you propose for any active buffer (we don't want to do that for all modules because that would be very expensive and, for compiled modules, you only need to know the result (i.e. this function was imported from X but not all possible imports)).

And I agree, this solution needs to be available outside of Elixir, for availability and development speed. I'd recommend you folks to discuss at some point what some shared solution would look like when it comes to global environment and buffer environment, as originally outlined in this issue. For now I will personally only focus on providing the basic environment and macro manipulation functions in Elixir to aid building the buffer environment. For the compiler environment, tracing is pretty much the way to go (we need to add new traces though related to Mix). :) I will submit a proposal for the new functions relatively soon.

mhanberg commented 8 months ago

Just to clarify for everyone, I am actively working on this issue, and it lives in a 3rd party library called Spitfire.

I am going to continue to harden the parser while José/me/whomever sets/creates a public API for more compiler features in core and when those become available, I will use them in Spitfire to produce more or less the function spec that scohen shared (which I more or less already had living on my computer when I cam back to this issue to share my progress, haha).

In the end, I am very open to look at Spitfire and see if any parts (the parser, the API) can be upstreamed into core, but for now it sounds like keeping it in Spitfire is what we want in order to make faster progress.

michalmuskala commented 8 months ago

Thank you @michalmuskala! One of the things that are unclear to me is if the structure above is used for all files or only the currently open buffers. I understanding storing line+column (especially as a keyword) can be expensive, but I think that we can treat a buffer and a compiled module differently. Especially because in the buffer case you need to know line by line which imports, aliases, and so on are available to show for completion and you don't need that for compiled modules. So, in your experience, storing line+column is expensive even for buffers?

For RA and ELP there's generally no difference between the current file and other files - everything goes through the same API. In particular they don't rely on regular compiler to do analysis (other than some diagnostics triggered on-save). This generally isn't a big problem apart from macros, and makes the analysis significantly faster (and possible to do as-you-type, rather than on-save). Conceptually, once the LSP state is loaded, the entire state is held in the process and disk or external programs are not consulted for anything.

In Erlang's case regular macros are relatively easy to re-implement on top of our syntax tree. For parse transforms we just don't support them, outside of limited number of standard parse transforms that we effectively re-implement to work on top of our syntax tree. For eqWAlizer we actually do use the Erlang compiler ASTs to do the analysis (since it was built at a time ELP didn't exist yet). To do that ELP runs a small escript as a sidecar process - there we maintain a full fork of the relevant compiler passes (lexer, pre-processor, parser, and erl_lint) to customise behaviour - in particular maintaining position as byte offsets and smarter include resolution - long-term we indent to modify those forks further to use ELP itself for include resolution and providing original source, rather than using state from disk since we identified this as a source of flakiness (LSP state and on-disk state can diverge especially during rebasing and other many-file operations).

For Rust it's somewhat more complex. For macro_rules they also re-implement them on top of their syntax trees. For procedural macros (arbitrary Rust programs somewhat closer to Elixir macros), they have a very well defined compiler/build system interface and have to be contained in a separate crate as a dependency, so they generally don't need to be dynamically re-compiled during regular operation. That said, they have, in places, limited support and quite a lot of complex integration logic to map from internal syntax trees, to the public interface of the compiler and then back, while trying to preserve position information.

josevalim commented 8 months ago

Perfect, thank you. I think it may be worth keeping them separate in Elixir, due to the amount of work it may be required to store precise imports, aliases, and variables based on macros based on position. I guess we will learn as we move forward.

scohen commented 8 months ago

Just to clarify for everyone, I am actively working on this issue, and it lives in a 3rd party library called Spitfire.

Yes, I understood that, my worry is that spitfire will then depend on what José comes up with, though I believe you're in the same boat as lexical and elixirLS, so upon further reflection, my concern might not be particularly warranted. I was actually wondering if we could have the Elixir changes as a separate library as well. My memory might be a little fuzzy, but I think I remember GenStage had a particularly clever packaging model where it was released with an Experimental prefix that you could alias away via alias Experimental.GenStage with the eventual goal of placing it inside the elixir standard library.

Maybe we can do something similar with the new compiler enhancements? I'd like to ship them sooner rather than later.

josevalim commented 8 months ago

What @lukaszsamson does in ElixirLS is to have a copy of the functionality inside ElixirLS. For example, if new functions are added to Macro.Env, he keeps a copy as ElixirLS.Macro.Env (for example) and remove it once it is all up to date.

lukaszsamson commented 8 months ago

Does that sound like a viable first step to everyone?

It does for me

Some of the fields in the environment are very straight-forward to manage, such as module, function, file, etc.

It may be straight-forward to manage on the compiler end but mapping a cursor position to an env is not that easy. Due to macros it is a 1 to many relationship e.g. in this code the cursor env is module is both MyProtocol. List and MyProtocol.Map

defimpl MyProtocol, for: [List, Map] do
  # |
end

Or what is the env module and current function for quoted? Looking at the code it's easy to tell the function is go. What imports and locals are available inside go body?

defmodule My do
  defmacro __using__(_opts) do
    quote do
      def go(x) do
        # |
      end
  end
end

is there anything elixir_sense needs in particular you believe is worth adding to the scope?

@josevalim there are other things that are useful. Besides what you mentioned elixir_sense tries to extract from AST more:

A lot of the above are not things that the compiler is aware of. They are built by standard library macros but that macros do not provide necessary introspection features.

Speaking of API, I'd see it similarly to @scohen 's proposal. Id change the return type to a list of possible envs and the first argument to some syntax tree (Macro.t() or something other) so we don't need to reparse the document on every invocation.

@spec env_at(ast :: Macro.t(), line :: pos_integer(), column :: pos_integer()) :: [Macro.Env.t()]

but the column parameter is tricky. Consider this simple example. In which column can we say that variable foo is defined?

%{foo: foo} = some_fun()
josevalim commented 8 months ago

A lot of the above are not things that the compiler is aware of. They are built by standard library macros but that macros do not provide necessary introspection features.

Exactly. None of this is compiler behaviour. The best way is probably for the language server tooling to intercept those macros instead of expanding them, and then add it to its own environment.

josevalim commented 8 months ago

Hi everyone, I have created #13361 for a minimal proposal on the basic building blocks.

Outside of that, my current mindset is the following. This issue argues we need two contexts: buffer context and global context. The global context can be built with tracers and reflection, and there are no plans to make it part of Elixir. We are always happy to add more traces and reflections though.

The buffer context could be made part of Elixir, but right now it should not be, as doing so will only slow down progress. Instead, we are adding functions so you depend less on Elixir internals. If we make this part of Elixir in the future, we would probably make it so it receives Elixir AST as input, so it works both with Elixir's parser and external fault-tolerant parsers as Spitfire. While inclusion in Elixir can be useful, as we would be able to use it in both Livebook and IEx, it is not high priority at the moment, especially because IDEs require much more information.

Thank you for all of the excellent discussions.

Closing in favor of #13361.