Provide source Language Server Protocol source group

studgeek commented 5 years ago

Placeholder for idea that has been discussed in Slack a couple times. LSP would be very useful way for Sourcetrail to support lots of languages and via their LSP implementations.

The work could be done on top of https://github.com/CoatiSoftware/SourcetrailDB. There are some open question if LSP provides enough information. Specifically if it provides project level info (what files are there to index) and enough detailed information (detailed type and reference information).

studgeek commented 5 years ago

Note, the work in https://github.com/CoatiSoftware/SourcetrailDB/issues/2 is discussing using a LSP backend.

pidgeon777 commented 5 years ago

I'm interested also! Any news?

LouisStAmour commented 4 years ago

I don't think it's possible:

This example illustrates how the protocol communicates with the language server at the level of document references (URIs) and document positions. These data types are programming language neutral and apply to all programming languages. The data types are not at the level of a programming language domain model which would usually provide abstract syntax trees and compiler symbols (for example, resolved types, namespaces, …). The fact, that the data types are simple and programming language neutral simplifies the protocol significantly. It is much simpler to standardize a text document URI or a cursor position compared with standardizing an abstract syntax tree and compiler symbols across different programming languages.

From https://microsoft.github.io/language-server-protocol/overview

What might give useful information would be https://microsoft.github.io/language-server-protocol/specifications/specification-3-14/#textDocument_foldingRange (folding ranges) plus https://microsoft.github.io/language-server-protocol/specifications/specification-3-14/#textDocument_codeLens (code lens) plus a scan of https://microsoft.github.io/language-server-protocol/specifications/specification-3-14/#textDocument_hover (hover calls) but even the hover text is not actually machine readable. So you're back to the beginning -- the language server exposes UI conventions that an IDE finds useful, but the IDE only superficially understands the code over the LSP, it doesn't really understand it the way, say, a JetBrains IDE semantically scans and understands projects. The benefit here is it's much faster for the IDE, the drawback is your IDE works at a line/column offset for most queries, and so you don't actually need a complete source graph in memory at all times, potentially you can get by with just a quick scan for documented symbols and at hover lookup, just look at the nearest symbol under the cursor and pull up or compute a likely match at that time.

Now, if there are standard LSP implementation patterns or libraries, it might be worth trying to load an LSP's source code take advantage of any possibly standard internal data models. But you're operating outside the LSP at that point.

slimsag commented 4 years ago

(disclaimer: I work at Sourcegraph, but I am posting this just in my free time and have no stake in any of this -- I just think Sourcetrail is a cool project -- not a corporate shill 😄)

There is ongoing recent work to expose call hierarchy information via LSP: https://github.com/microsoft/language-server-protocol/issues/468 and that has also just been implemented in the node language server and typescript language server and it is likely other language servers will follow suite soon. Would this information be sufficient for what Sourcetrail provides?

Also, there is LSIF (you can read more about here: https://lsif.dev/ or watch this short talk from my coworker) which we @ Sourcegraph are maintaining several indexers for. The idea here is that as part of your CI you run an indexer which runs the language server, gathers all the relevant data into the LSIF format, and then uploads it somewhere for later realtime use (without invoking a language server which can be slow). One such location for these uploads will be Sourcegraph.com and I am sure there will be / are others (but I am not aware of any currently).

If the call hierarchy information would be sufficient for Sourcetrail, then perhaps Sourcetrail could just consume LSIF data dumps directly.

If users start uploading LSIF data for all open source repositories to Sourcegraph.com (which is our hope as that lets us provide precise and fast jump-to-definition / hovers / find references) then it would also be easy for you to fetch those dumps via Sourcegraph's GraphQL API here (currently experimental) and you could have the overall experience be something like:

Add an LSIF indexer to your CI pipeline / travis config / etc.
(optional) Upload the LSIF data to Sourcegraph.com
Launch Sourcetrail, specify a LSIF dump file path or enter a github.com URL and it'll fetch from Sourcegraph.com's uploaded ones and it Just Works™

LouisStAmour commented 4 years ago

LSIF and call hierarchy information are both interesting, LSIF in particular. As mentioned in the above issue, #750 I've been looking at how we can get enough info into the graph to derive framework-specific details and annotations, but I think that's a nice-to-have and these other options would be faster to start.

To that end, if you switch from a model where language details are modelled in Sourcetrail to a model where markup is shown in Sourcetrail, it's possible you'd achieve much faster implementation using LSP or LSIF, since both are primarily designed for human-interactive consumption within an IDE or code viewer. But you'd be limiting yourself to the exposed semantics which don't actually include a complete AST because the details for each language are left implementation-specific.

I personally think the most interesting graphs are yet to be added to Sourcetrail or Sourcegraph, such as incorporating data flow, ERD, micro-service or deployment architecture (projects in a larger context) and request/response models that can cross languages (JS API calls to backends, for example). It's possible LSP is a great model to follow, but for obvious reasons, it's not easily multi-language compatible because it doesn't expose rich enough semantics to build human readable responses from compilers, instead it trusts the compiler to put things in terms humans understand and any richer details need to be parsed from that human-readable markup. This works fine for language basics, but if you wanted to add additional annotation from third-parties, it seems to me it would become much harder, quickly, unless you start chaining LSP servers which in turn would duplicate language parsing work unless they share the same implementation bases, which seem impractical if trying to combine multiple languages.

I might have taken this to a theoretical extreme here, and so I do want to try implementing based on LSP call graph/LSIF using the existing SourcetrailDB SDK, but I think LSP's core assumption that all an IDE needs is information about a language from a single language is itself a limitation of the current API, and unless a generic way to pass or work from a CST/AST is added as part of implementation guidance, and endpoints specific to the needs of Sourcetrail (such as, obviously call graph) are added, it seems to me the current approach simply won't scale well to add features beyond whatever VS Code currently supports. Meaning, the plugin system of VS Code is hiding deficiencies in the LSP approach, but it's harder for us to support VS Code plugins as a source of information for obvious reasons.

mlangkabel commented 4 years ago

@slimsag, thanks for chiming in! I've been looking at the LSP and also at LSIF and it looks like both of these protocols only provide information about

the definition of symbols (e.g. what's the symbol name, type, location, etc)
the location of references of a defined symbol (e.g. source location from line 5 col 3 to line 5 col 7 references symbol X

What is missing for Sourcetrail is

The context of the referenced symbol (e.g. which method called the referenced function). But this may be a minor issue, as the context itself is a symbol that is defined at a certain location. So if a reference is located inside the source range of a function definition, we can derive that it is exactly that function that references the referenced symbol.
The reference kind (e.g. is my referenced function called at the current location? Or is it for example just used to assign a function pointer?). This one is tricky, if the LSP or LSIF do not store this information. As illustrated by the example, the ReferenceKind (call, usage, etc) cannot just be derived from the referenced symbol.

Please correct me if I'm wrong. I would really enjoy to be wrong on this one!

CoatiSoftware / Sourcetrail

Provide source Language Server Protocol source group #685