Is this project active? I have some ideas

elbywan / crystalline

A Language Server Protocol implementation for Crystal. 🔮

MIT License

448 stars 24 forks source link

Is this project active? I have some ideas #53

Open asterite opened 1 year ago

asterite commented 1 year ago

Hi!

First of all, thank you so much for this project!

I have some ideas of how to provide autocompletion that wouldn't take that much time, although it might not be 100% accurate. I thought about creating a separate project but maybe it's better if I contribute to this one.

But before sharing my ideas, I'd like to understand a few things:

Is this project active? Would it accept contributions?
Is there an overview of how things work now?

After that, I could share some ideas I have... if they are very different from how things are working now :-)

Thank you!

elbywan commented 1 year ago

Hey @asterite,

First of all, thank you so much for this project!

Thank you for the kind words (and for creating the crystal language)! ❤️

Is this project active? Would it accept contributions?

It is mostly in life support mode but definitely in working state, I am using crystalline on all my crystal projects.

I am a bit dissatisfied with the time the compiler takes to perform the semantic analysis pass on a large codebase even though it still works quite fine for small to medium scale, and I did not find a way to solve that so I stopped working on the internals for the moment.

Is there an overview of how things work now?

Not right now even though the code is fully commented, but I can make something up to explain the big picture for sure. I just can't today (I'm sharing my time between the family, my work and another soon to be open sourced project), but I can find some time tomorrow or later this week if it's fine with you.

After that, I could share some ideas I have... if they are very different from how things are working now :-)

Looking forward to that! 💡

asterite commented 1 year ago

I just can't today (I'm sharing my time between the family, my work and another soon to be open sourced project), but I can find some time tomorrow or later this week if it's fine with you.

Oh, take your time! I'm in the same situation as you. I wanted to start working on this slowly, whenever I have time, no rush at all :-)

I am a bit dissatisfied with the time the compiler takes to perform the semantic analysis pass on a large codebase

Yup! Given that it's going to be hard to change that, I think the idea would be to avoid doing a full semantic pass. I see that Workspace#compile has a top_level = false argument. In my mind top level should always be used, as that gives you the program's structure (classes, modules, methods and their types, etc.) Full semantic analysis gives you method instantiations and types in the top-level program (like a = 1 in the top level) but I think that should never be necessary for autocompletion.

Looking forward to that! 💡

So here's the rough idea (most of these steps probably already work like that in crystalline):

The top priority is to make it work fast, sacrificing accuracy if necessary. In my mind it's better to have something that works fast but misses an autocompletion from time to time, that something that works all the time but has you waiting seconds to show up, and consumes a lot of memory.
The server keeps the types and methods structure for a given project, cached somewhere.
When there are file changes, or typing on a file, the server will update that information, but lazily (background thread or something like that.) The idea is that changes are usually small, and come slow, so if we have a slightly outdated model it should still be good for autocompletion, even though not 100% accurate.
We only rely on top-level semantic, never on full semantic analysis
Specifically for autocompletion we implement a new, lighter semantic analysis pass, that doesn't reuse the compiler's one. The idea here is to avoid again doing full semantic analysis because it's slow.

So how does the semantic analysis work?

Let's start with an example.

Say you have this:

def foo(x : Int32)
  x.| # `|` is the cursor's position
end

Here we'll notice that x has a type restriction of Int32. We solve that (we have that in our cached model) and provide the methods that Int32 has. Easy!

But of course this is a very simple example.

What about this?

def foo(x)
  x.|
end

Well, here we can't know anything about x. Well, we could know if we knew what instantiations we have for foo... but that requires a full semantic analysis, so it's a no-no.

What can we do? Just assume we don't know anything about x. x could be any type. And because it can be any type, we provide a list of methods for every type out there. Well, maybe those would be too many methods, so we could keep maybe the first 100 or something. As the user types the list of methods will be reduced. Another alternative could be to show only the methods that exist in Object. This is up for debate.

Let's say the user continued typing:

def foo(x)
  y = x.reverse
  y.|
end

Now what? Well, we'll assume that x has a reverse method that takes no arguments. We check which types have these methods. Let's say we found that Array and String have them. We check that the return type of Array#reverse is Array(T) so we can assume y could be that type. We also check that String#reverse returns String (it has that return type) and so y could also be a String.

Then the autocompletions for y.| are all of those for String and Array.

Of course, if in another part of that method there's a call to x.bytesize then we'll know for sure that x must be a String, never an Array, so we can provide better autocompletion.

Here we got lucky because both Array#reverse and String#reverse have return type annotations. What if they don't? Well, in that case we could apply some heuristics... the same ones Crystal uses.

For instance, the definition of Array#reverse is this:

  def reverse : Array(T)
    Array(T).new(size) { |i| @buffer[size - i - 1] }
  end

We can see that the last expression returned is Array(T).new so without doing any semantic analysis we could guess that it returns Array(T). For String it's similar, except that String.new is done inside an if branch (but we could guess that the union type is returned.)

Anything more complex and we just abort and assume we don't know anything about the returned type. This of course is less accurate that doing a full semantic analysis, but at least it can never be slow (these heuristics are very fast.)

Also, we could say that crystalline works better if you type your methods (both input and output types.) That's also helpful for documentation, for readers of the code, etc. So in the end crystalline could also encourage adding more explicit types to the code.

I think the only slowness could come from running top-level which might rely on some macro runs, etc. But that slowness would only appear the first time the entire project is loaded. All subsequent model updates could happen lazily in the background (and we use the existing slightly outdated model in the meantime.)

What do you think about this approach?

elbywan commented 1 year ago

@asterite

What do you think about this approach?

I think that using it would be a tremendous improvement!

The top priority is to make it work fast, sacrificing accuracy if necessary. In my mind it's better to have something that works fast but misses an autocompletion from time to time, that something that works all the time but has you waiting seconds to show up, and consumes a lot of memory.

I fully agree, but we must be aware that since it will be decoupled from the Crystal compiler it will need to be maintained apart and that some diagnostics crystalline reports as of today will slip through with the new way of doing things.

Also, we could say that crystalline works better if you type your methods (both input and output types.) That's also helpful for documentation, for readers of the code, etc. So in the end crystalline could also encourage adding more explicit types to the code.

Totally 👍.

Specifically for autocompletion we implement a new, lighter semantic analysis pass, that doesn't reuse the compiler's one. The idea here is to avoid again doing full semantic analysis because it's slow.

Autcompletion is one thing, but it would also be very important to implement a (limited) form of typechecking to be able to provide code diagnostics.

I think the changes can be integrated quite easily into the existing codebase, but more on that later. First as promised I will try to depict a bit how crystalline works as of today, focusing on the analysis part.

Today

Crystalline is basically a Language Server, a loop waiting for notifications and messages from a client and communicating through STDIN / STDOUT. A good chunk of the Language Server Protocol Specification has been implemented under the /lsp folder.

The server delegates actions to a controller instance which performs an action depending on whatever the client requests. Some mutexes are used to synchronize some operations to prevent for instance saving a file while formatting it, or launching multiple compilations at the same time.

When a file is opened, the LSP stores it in memory which makes it possible to provide diagnostics without having to save the file. The memory representation gets updated as the user makes changes in the editor.

Crystalline also tries to determine what files are part of the workspace by crawling the entry point using a top level pass and checking the require calls. Whenever these files are modified the entry point will be used instead of the file itself just like the user would do when running crystal build [entry point].

So with that said, I will try to focus on the analysis part now.

Here are some (tiny) additions (monkey patches) I made to the compiler:

file_overrides to be able to work on in-memory files instead of reading from disk
fail_slow_compile to prevent raising whenever the compiler encounters an error and return a partially typed AST instead
some other stuff that I don't remember right now

Here is the method used to prepare the analysis on a workspace or a file. It is called at startup, whenever a file is saved or updated and when hovering something.

The compilation output (Crystal::Program) is cached and the goal of the Workspace.compile method is to wrap the actual compiler call and check if the AST is cached and not invalidated (as well as some other minor things like reporting to the client that a compilation is pending for instance and reply with the diagnostics)

Analysis.compile is the underlying method calling the compiler in a dedicated Thread in order to prevent blocking the main server thread, and this is the reason why crystalline needs -Dpreview_mt.

For hover, completion and go to definition requests additional processing is done, Visitors crawl the AST and format the infos to the client.

Now what?

So the good thing is that I think we can reuse most of the existing crystalline logic:

caching the top-level compiler output should be similar (Crystal::Program)
the compiler is already spawned in a background dedicated thread (even though we would need to reply without waiting for it to complete after the first time)
the LSP part should roughly stay the same

I can make a dedicated branch and refactor the code to follow the logic you posted. Roughly the changes would be to:

use top level instead of full semantic analysis
make the analysis non-blocking
remove the visitors and everything relying on the current compiler ast

And then fill the gaps later on.

I'll try to work on this before the end of the week if it's fine with you to provide the branch and the groundwork 😄.

Then we could iterate on the rest, but I'm afraid I would not be very helpful on writing the parser/typer since you have way more experience on this and it has been ages (12+ years) since I've worked on something similar.

It's been 1 hour since I started writing I'm running out of time for now, but please do tell me if you need anything from my end of if something is not clear.

asterite commented 1 year ago

Warning: a lot of text ahead. But before you start reading it: please don't start working on any of this yet! 🙏 The reasons are:

It's not totally clear what we want to do, and how we want to do it
You told me you don't have much spare time so please, please, don't sacrifice that other time for this
Also my intention wasn't to throw some ideas and have you or others implement it. I want to get involved :-)

I pushed a branch here: https://github.com/asterite/crystalline/tree/lightweight

There's nothing much in that branch yet. It's just me experimenting a bit with the code, exploring some ideas. More on this later.

I think that using it would be a tremendous improvement!

That's great to hear! 🙌

I fully agree, but we must be aware that since it will be decoupled from the Crystal compiler it will need to be maintained apart and that some diagnostics crystalline reports as of today will slip through with the new way of doing things.

I've been thinking about this and I realized that we could actually reuse most of the semantic phase. There's a type, MainVisitor, that is used to type the "main" function of the program, but also to type methods with concrete type information. We could reuse that type and when a call needs to be solved instead of doing full semantic analysis on that call we can do the lightweight logic. That way type filtering (when you do if var) works the same way as in the compiler.

That said, if we aren't using the same algorithm as the compiler, but a weaker one, where types might not be fully correct or fully known, then reporting an error might sometimes be misleading.

So another idea could be to have this lightweight mode just be an option you can choose in the settings. Something like "Type-checking mode":

[ ] Heavy: will always be correct, but it can be slow (recommended for small and medium projects)
[ ] Lightweight: less accurate but always fast (recommended for large projects)

That way we don't have to throw away all the wonderful work you've done, which is useful for small and medium projects, and even in large projects people could choose the heavy mode if they use autocompletion sparsely and when they do they'd want to know they get the right thing.

Another idea: time how long it takes for the full semantic analysis. If we notice it's above some threshold we automatically switch to lightweight mode. Maybe that could be another mode ("Let Crystalline decide the mode based on the size of the project".)

A good chunk of the Language Server Protocol Specification has been implemented under the /lsp folder.

What do you think about extracting this to a shard? When I had the idea of implementing a different LSP for Crystal just the idea of implementing the protocol made me lazy about doing it because it's work unrelated to the main logic of providing autocompletion, etc.

There's also https://github.com/jemc/crystal-lsp so there's always some duplicate effort.

An LSP shard will help in this way:

The shard will only be about that
Anyone could use it, not only for a Crystal LSP but for other languages too (in the case of jemc it's used for the mare language)
Because of the above point contributions are more likely to happen, improving the quality of the shard
It will remove a lot of code from crystalline, making it more accessible for contributions ("Oh, I don't need to look at these files because they aren't related to the core crystalline functionality")

If that's fine with you I could extract such shard, maybe put it in https://github.com/crystal-community and then send a PR to use that in crystalline.

BTW, I noticed your LSP protocol implementation has excellent comments that helped me understand some things without having to read about LSP. Thank you! 💙 (that's why I think the LSP shard should be based on your code)

When a file is opened, the LSP stores it in memory which makes it possible to provide diagnostics without having to save the file. The memory representation gets updated as the user makes changes in the editor.

I saw this! I also saw the override in SemanticVisitor. That's great! I thought about how to do this before looking into this project and it was so cool to see you've already figured it out 😄

Crystalline also tries to determine what files are part of the workspace by crawling the entry point using a top level pass and checking the require calls.

This is great too!

fail_slow_compile to prevent raising whenever the compiler encounters an error and return a partially typed AST instead

I have some ideas about this which I'll detail later.

The compilation output (Crystal::Program) is cached

So it seems that one thing I was proposing was already implemented 🙌

Analysis.compile is the underlying method calling the compiler in a dedicated Thread

This is excellent. I had no idea about how to do this. Maybe launching a separate Thread isn't ideal (I think Thread is kind of private API) but with this I mean that Crystal should provide a safe way to do it, it's just not there yet.

So the good thing is that I think we can reuse most of the existing crystalline logic

It seems so!

I can make a dedicated branch and refactor the code to follow the logic you posted

Oh, please don't (for the reasons I mentioned in the beginning.)

I'll try to work on this before the end of the week if it's fine with you

What do you think if we start by creating a branch where we can push things together and experiment? No need to use my branch's code as I removed a lot of your code and I'd actually want to keep the two modes.

In that branch we can start by:

Adding an option to use the lightweight mode in Settings
Switching to lightweight mode, which for now doesn't provide any autocompletion at all (the autocompletion code I have in my branch is useless, I just wanted to see something show up in VSCode)

Then we can iterate on the rest, as you say (but not yet on the things you said.)

I always like to do things in small steps, and these were my ideas:

First let's make the core logic work for a single file, without taking a look at other files. We'll assume that the user won't be defining new types, so only the standard library types will be used. For this we won't need to do a top-level semantic, but we can already do a lot (do the lightweight semantic analysis, provide autocompletion for instance methods, class methods, local variables, constants, types, etc.)
For the above point I think it will be very important to start writing tests. The autocompletion tests will be "given this input file, a cursor position and a program, check the expected autocompletions."
Another thing that would be kind of important to have is a fault-tolerant parser. I saw some code that if you have a dot and the cursor is right after it, remove the rest of the line (and I think the dot) so that it compiles. This is good but I think some other cases will fail, like for example if you have if something.| but there's no end for that if yet. I started playing with the idea of a Fault tolerant parser (specs) but I think that alone won't be enough. So another idea would be to first rewrite a source code so that it's valid, adding missing ends or curly braces, maybe based on indentation. Yesterday I also found this proposal for Ruby which surprisingly lists some ideas that I also had about this.
Once we have the framework for the above (being able to easily extend the logic while also easily writing tests for that) we can start making the logic work with the entry point, doing a top-level semantic analysis for everything, using the cached results, etc.

What do you think?

In the meantime I'll continue experimenting in my branch with the ideas above. So, again, please don't start working on anything 🙏 (or, if you want, you can create a branch for this and make me a contributor 😊, unless you prefer that I work on my existing branch and I can make you a contributor there, either is fine with me)

asterite commented 1 year ago

Actually I'm thinking that some of this can be done in completely separate steps

Extracting the LSP protocol-only code to a shard
Allowing for broken parse inputs.

If it's okay with you I can work on those two things. I think once we have those in place it'll be easier to work on the lightweight semantic analysis.

I'll open separate issues for those to keep the discussion focused, and also to be able to later link PRs to those issues. For the second point in particular I'll post some snippets where autocompletion (or everything else) should work but currently doesn't.

elbywan commented 1 year ago

What do you think?

Adding a "lightweight" mode seems like the best solution, I'm fully aligned with the plan 👍.

What do you think if we start by creating a branch where we can push things together and experiment? No need to use my branch's code as I removed a lot of your code and I'd actually want to keep the two modes.

Let's do that (plus this is exactly what I had in mind). I'll create a v2 branch we can both use.

If it's okay with you I can work on those two things. I think once we have those in place it'll be easier to work on the lightweight semantic analysis.

I'll discuss these topics on the issues you created directly.

Thanks again ❤️!

zw963 commented 1 year ago

```crystal
def foo(x)
  x.|
end
Well, here we can't know anything about x. Well, we could know if we knew what instantiations we have for foo... but that requires a full semantic analysis, so it's a no-no. What can we do? Just assume we don't know anything about x. x could be any type. And because it can be any type, we provide a list of methods for every type out there. Well, maybe those would be too many methods, so we could keep maybe the first 100 or something. As the user types the list of methods will be reduced. Another alternative could be to show only the methods that exist in Object. This is up for debate.

Still no read all discussion, but, i prefer if the type of x is unpredictable, we should not return all methods or methods in Object, instead, we can return only one item, Every, or Dynamic, or whatever, it just tell the user, I can't deduce it, clearly.

Other, it will no different with o = Object.new; o.|, in this case, o is a Object instance.

elbywan commented 1 year ago

Still no read all discussion, but, i prefer if the type of x is unpredictable, we should not return all methods or methods in Object, instead, we can return only one item, Every, or Dynamic, or whatever, it just tell the user, I can't deduce it, clearly.

I strongly disagree (even though I don't think this is the right time to debate about this change anyway).

In the example x could be anything since its type cannot be narrowed down and it seems completely fine to me to provide every method from all known types.

Other, it will no different with o = Object.new; o.|, in this case, o is a Object instance.

It will since in o = Object.new; o.|, the type can be inferred as Object and the autocompletion would only output Object methods.

zw963 commented 1 year ago

Any update for this? can we introduce the current process in here or on forum?

elbywan commented 1 year ago

As far as I know, the plan is clearly expressed in the above comments of this thread.

This shard has no link with the crystal forum or the crystal team, I am the only maintainer and I clearly stated that I was focusing on other things right now. I believe that @asterite also has other concerns for the moment.

There are sub-issues linked to this one that track the progress and that will get updated if we resume the work.

On a side note, I understand that some people may express some impatience/frustration but I would rather like not being pinged when there is nothing meaningful to say.

Please also note that:

I am maintaining it on my (limited) free time
I am distributing it to the community free of charge
This project is open sourced and MIT licensed, anyone can fork / make a PR to contribute
Personally, I am happy with the current crystalline state myself and any improvements/changes are only for the benefits of others

zw963 commented 1 year ago

I care about Crystal, so I care about the progress of Crystal on LSP, because it is very important for newcomers (I don't need it at all), so I ask for this, but, i feel some grumpy smell from the above answer, small communities bring less contributions, and because missing some feature, fewer people use them, a vicious circle.

elbywan commented 1 year ago

I care about Crystal, so I care about the progress of Crystal on LSP, because it is very important for newcomers (I don't need it at all), so I ask for this

I also care a lot about Crystal and I think I made my share of the work by providing quite a bunch of shards to the community. Because I see value in concrete actions like coding and open sourcing actual work, not meaningless discussions.

i feel some grumpy smell from the above answer, small communities bring less contributions, and because missing some feature, fewer people use them, a vicious circle.

I fail to see where the grumpiness is. I stated some facts that I felt like were important to mention in this context.

zw963 commented 1 year ago

I fail to see where the grumpiness is. I stated some facts that I felt like were important to mention in this context.

Sorry for use wrong word for my thought. i just thought you might not want someone to ask about the progress, and all works is for other people, if someone want those feature, he can always contribute on it.

Anyway, i get my answer, features in this thread will not be implemented soon. thanks.

jweckman commented 1 year ago

From a usability standpoint i would make it default to some kind of dynamic heavy/light mode based on the performance of the system in use. I've had to mess around a lot with both pyright and jedi on the python side of things and i'm not happy because pyright has more functionality but i can't use it at work because the code base is so large that it just falls on its face and i have to use jedi.

I think many of us also like to develop remotely on a home server with only 1/10 performance of our main rig, which can mean that we want our same medium sized project to use heavy mode when on a beefy setup and light mode when using a potato. It would be nice if this was abstracted away from new users and available for hard-coding for those that look for it.

Just my 2 cents.

plambert commented 1 year ago

In the case of something like:

def foo(x)
  x.|
end

How useful/useless would it be to consider other variables in the same class with the same name? I find that they typically have the same type, but that is anecdotal. However if there's enough info in bar(x) to know that x is Int32|Symbol|IO there, then that could be really useful.

Also, would a way to pass "hints" be of value? For example:

def foo(x) # crystalline-hint: x : Int32 | String | Nil
  x.|
end

That would tell crystalline that if it can't figure out what x is, it should use Int32 | String | Nil, which would be a help if it is actually something else but the developer doesn't want to explicitly specify the type because "something else" isn't really firm yet. And the hint would be ignored if crystalline knows better... like if a background task determines clearly that it's only ever called with an Int32, so String and Nil are dropped from the union. In a large project, a hint like this could save a lot of time, without introducing actual code changes to accommodate autocompletion…

As I'm neither an expert nor even a very well-informed observer, these are just ideas, and I'm equally excited to learn why that they are terrible ideas as I would be if they were good ones.

dsisnero commented 1 year ago

Is it possible to adopt some of the ideas from rust-analyzer - the extracted https://github.com/salsa-rs/salsa . It is what they use for on demand , incremental updates

AlexDev404 commented 1 month ago

I care about Crystal, so I care about the progress of Crystal on LSP, because it is very important for newcomers (I don't need it at all), so I ask for this, but, i feel some grumpy smell from the above answer, small communities bring less contributions, and because missing some feature, fewer people use them, a vicious circle.

True, since I was taking a look at Crystal, and seeing that the language support is just very limited on editors along with no official LSP draws me away from really using it at all as it's such an integral feature for me.