davidhalter / jedi

Awesome autocompletion, static analysis and refactoring library for python
http://jedi.readthedocs.io
Other
5.79k stars 508 forks source link

Create a Database Index for Jedi #1059

Open davidhalter opened 6 years ago

davidhalter commented 6 years ago

For a lot of things (especially usages) jedi's completely lazy approach is not good enough. It is probably better to use a database index cache. The index will basically be a graph that saves all the type inference findings.

This is just an issue for discussion and collection of possible ideas.

hajs commented 6 years ago

Recently I wrote my own source indexer using jedi. I used a sqlite databases with four tables: file, name, definition and reference. Indexing stdlib took about four minutes using multiprocessing on four cores,

It would be great if the index could be exposed to the public api in some way.
Besides finding all usages of a definition, a database could be used to offer auto imports and fast fuzzy auto-complete.

davidhalter commented 6 years ago

That's actually pretty fast. Did you index all the subfolders (asyncio, multiprocessing, json, etc)?

Also can you post the script? I wonder if it's "complete".

davidhalter commented 5 years ago

@hajs I would still be interested :)

jakubzitny commented 4 years ago

Are there any next steps on this issue? Maybe one more friendly ping for @hajs's solution..

davidhalter commented 4 years ago

This issue will likely not be implemented in Jedi. I'm probably going to try to re-implement Jedi in Rust, to gain further speed and fix this issue. But we're speaking years here for that to be as solid as Jedi.

CJ-Wright commented 3 years ago

@davidhalter are you still interested in this idea? I'm currently planning on building out a database of all potential imports using Jedi for the symbol inspection. Would you be interested in the issues I find? If so, what would be the best format to report errors in?

davidhalter commented 3 years ago

@davidhalter are you still interested in this idea? I'm currently planning on building out a database of all potential imports using Jedi for the symbol inspection. Would you be interested in the issues I find? If so, what would be the best format to report errors in?

I'm definitely interested in your findings, but as I said above, it's pretty unlikely that Jedi's architecture is going to change a lot. There are a lot of underlying issues. I'm currently rewriting parso in Rust and having a great time (it's not open source yet, though).

AlanSwenson commented 3 years ago

@davidhalter very interested in contributing to rust version of parso and Jedi when you open them up.

davidhalter commented 3 years ago

Will post it here once it's in a good shape. However I want to do a lot of things the right way this time so I'm keeping it private for now.

I have been working on the parser for the last three months, but I unfortunately don't have a lot of time for it.

krassowski commented 3 years ago

Thank you for working on this! In the meantime, would it be appropriate to have get_signatures cached the same way as _get_docstring_signature is being cached? (as in https://github.com/davidhalter/jedi/commit/bf446f2729c53eb54edab55ae12f4dba252f8bda)

I profiled some language servers using jedi and it appears that get_signatures call is the major bottleneck. I understand that for an improvement I could patch those to use _get_docstring_signature, but it includes type annotations and is a part of a private interface so it is not ideal. Would adding get_cached_signature or get_cached_signatures be in scope, or should we just wait for the upcoming database index?

davidhalter commented 3 years ago

I profiled some language servers using jedi and it appears that get_signatures call is the major bottleneck

What did you profile? Can you share the results?

krassowski commented 3 years ago

Please see https://github.com/palantir/python-language-server/issues/823#issuecomment-774695123

davidhalter commented 3 years ago

This is a tricky one.

Basically it's definitely not possible to do this in a general way, because the Jedi caches need to be invalidated somehow if a library changes. This is exactly what this issue is about.

However, I thought that we could maybe use the cache just if is_big_annoying_library is true (that would probably help) in Jedi and just cache signatures in those cases. But even that is probably a bad idea. Jedi is not built to deal with multiple inference_state instances.

I think I would just argue that get_signatures is not built to be used for every completion. It's something you should use for maybe 10 results or ideally only for one.

krassowski commented 3 years ago

Thank you for getting back to me. I worked around this deferring the call to get_signatures() by calling it in a separate thread and caching:

https://github.com/palantir/python-language-server/compare/develop...krassowski:feature/asynchronous/labels-cache

It was tricky, especially with the jedi being not exactly thread-safe but adding a lock solves the issue. I decided to use a custom cache key instead of the default hash implementation (to avoid inclusion of inference_state) and to re-schedule refresh at every user action rather than guess when to invalidate.

Your replay will certainly help to plan for the future, and potentially to upstream such an approach. I got down to <<1 second for numpy. It might not be perfect, but possible a good proof of concept of how one could approach this.

davidhalter commented 3 years ago

Note that with such an approach you're also losing some of Jedi's correctness. I would really recommend to use something like https://github.com/davidhalter/jedi/blob/master/jedi/inference/helpers.py#L194-L202 and only apply caching to those libraries.

In general almost all other libraries are not an issue, because they do not export a thousand functions in one module. The culprits are always pandas, numpy, tensorflow and matplotlib.

krassowski commented 3 years ago

Thank you. I gave up on the asynchronous approach, and followed your advice to treat the likes of numpy differently.

It's something you should use for maybe 10 results or ideally only for one.

I understand. I will try to nudge the popular language servers in this direction (but it might take time as it is only possible with recent LSP 3.16 and many clients believe that the label - which is what the signature is being used for - should be available from the beginning). Nonetheless, I will be very happy to see any performance improvements to get_signatures().

vikigenius commented 3 years ago

@davidhalter I wonder what's the difference between something like https://github.com/tree-sitter/tree-sitter-python and parso ? Can't you leverage tree-sitter somehow ? It seems to be written in C and has fairly decent Rust bindings.

iustin94 commented 3 years ago

This sounds like a very interesting task, I'm not sure what the etiquette is in regards to helping out but I would be interested in contributing to the rust re-implementation of Jedi :+1: