Open davidhalter opened 6 years ago
Recently I wrote my own source indexer using jedi. I used a sqlite databases with four tables: file, name, definition and reference. Indexing stdlib took about four minutes using multiprocessing on four cores,
It would be great if the index could be exposed to the public api in some way.
Besides finding all usages of a definition, a database could be used to offer auto imports and fast fuzzy auto-complete.
That's actually pretty fast. Did you index all the subfolders (asyncio, multiprocessing, json, etc)?
Also can you post the script? I wonder if it's "complete".
@hajs I would still be interested :)
Are there any next steps on this issue? Maybe one more friendly ping for @hajs's solution..
This issue will likely not be implemented in Jedi. I'm probably going to try to re-implement Jedi in Rust, to gain further speed and fix this issue. But we're speaking years here for that to be as solid as Jedi.
@davidhalter are you still interested in this idea? I'm currently planning on building out a database of all potential imports using Jedi for the symbol inspection. Would you be interested in the issues I find? If so, what would be the best format to report errors in?
@davidhalter are you still interested in this idea? I'm currently planning on building out a database of all potential imports using Jedi for the symbol inspection. Would you be interested in the issues I find? If so, what would be the best format to report errors in?
I'm definitely interested in your findings, but as I said above, it's pretty unlikely that Jedi's architecture is going to change a lot. There are a lot of underlying issues. I'm currently rewriting parso in Rust and having a great time (it's not open source yet, though).
@davidhalter very interested in contributing to rust version of parso and Jedi when you open them up.
Will post it here once it's in a good shape. However I want to do a lot of things the right way this time so I'm keeping it private for now.
I have been working on the parser for the last three months, but I unfortunately don't have a lot of time for it.
Thank you for working on this! In the meantime, would it be appropriate to have get_signatures
cached the same way as _get_docstring_signature
is being cached? (as in https://github.com/davidhalter/jedi/commit/bf446f2729c53eb54edab55ae12f4dba252f8bda)
I profiled some language servers using jedi and it appears that get_signatures
call is the major bottleneck. I understand that for an improvement I could patch those to use _get_docstring_signature
, but it includes type annotations and is a part of a private interface so it is not ideal. Would adding get_cached_signature
or get_cached_signatures
be in scope, or should we just wait for the upcoming database index?
I profiled some language servers using jedi and it appears that get_signatures call is the major bottleneck
What did you profile? Can you share the results?
This is a tricky one.
Basically it's definitely not possible to do this in a general way, because the Jedi caches need to be invalidated somehow if a library changes. This is exactly what this issue is about.
However, I thought that we could maybe use the cache just if is_big_annoying_library
is true (that would probably help) in Jedi and just cache signatures in those cases. But even that is probably a bad idea. Jedi is not built to deal with multiple inference_state
instances.
I think I would just argue that get_signatures
is not built to be used for every completion. It's something you should use for maybe 10 results or ideally only for one.
Thank you for getting back to me. I worked around this deferring the call to get_signatures()
by calling it in a separate thread and caching:
It was tricky, especially with the jedi being not exactly thread-safe but adding a lock solves the issue. I decided to use a custom cache key instead of the default hash implementation (to avoid inclusion of inference_state
) and to re-schedule refresh at every user action rather than guess when to invalidate.
Your replay will certainly help to plan for the future, and potentially to upstream such an approach. I got down to <<1 second for numpy
. It might not be perfect, but possible a good proof of concept of how one could approach this.
Note that with such an approach you're also losing some of Jedi's correctness. I would really recommend to use something like https://github.com/davidhalter/jedi/blob/master/jedi/inference/helpers.py#L194-L202
and only apply caching to those libraries.
In general almost all other libraries are not an issue, because they do not export a thousand functions in one module. The culprits are always pandas
, numpy
, tensorflow
and matplotlib
.
Thank you. I gave up on the asynchronous approach, and followed your advice to treat the likes of numpy
differently.
It's something you should use for maybe 10 results or ideally only for one.
I understand. I will try to nudge the popular language servers in this direction (but it might take time as it is only possible with recent LSP 3.16 and many clients believe that the label - which is what the signature is being used for - should be available from the beginning). Nonetheless, I will be very happy to see any performance improvements to get_signatures()
.
@davidhalter I wonder what's the difference between something like https://github.com/tree-sitter/tree-sitter-python and parso ? Can't you leverage tree-sitter somehow ? It seems to be written in C and has fairly decent Rust bindings.
This sounds like a very interesting task, I'm not sure what the etiquette is in regards to helping out but I would be interested in contributing to the rust re-implementation of Jedi :+1:
For a lot of things (especially usages) jedi's completely lazy approach is not good enough. It is probably better to use a database index cache. The index will basically be a graph that saves all the type inference findings.
This is just an issue for discussion and collection of possible ideas.