Integrate with GitHub's Semantic

XVilka commented 5 years ago

Semantic is a Haskell library and command line tool for parsing, analyzing, and comparing source code.

In a hurry? Check out our documentation of example uses for the semantic command line tool. It will allow better support for other languages as well:

Priority	Language	Parse	Assign	Diff	ToC	Symbols	Import graph	Call graph
1	Ruby	✅	✅	✅	✅	✅	✅	🚧
2	JavaScript	✅	✅	✅	✅	✅	✅	🚧
3	TypeScript	✅	✅	✅	✅	✅	✅	🚧
4	Python	✅	✅	✅	✅	✅	✅	🚧
5	Go	✅	✅	✅	✅	✅	✅	🚧
	PHP	✅	✅	✅	✅	✅
	Java	🚧	🚧	🚧	🔶	✅
	JSON	✅	✅	✅	N/A	N/A	N/A	N/A
	JSX	✅	✅	✅	🔶
	Haskell	🚧	🚧	🚧	🔶	🚧
	Markdown	✅	✅	✅	🔶	N/A	N/A	N/A

✅ — Supported
🔶 — Partial support
🚧 — Under development

It uses tree-sitter for parsing.

LouisStAmour commented 5 years ago

This is very interesting. I like how they're approaching the problem in a multi-language fashion, and given the similarities between the TypeScript and C# compilers, such as how they treat comments ("trivia"), I was going to base my work on C# with that of tsc.

I'll start by saying that I'm not officially part of Sourcetrail in any capacity, just a volunteer interested in adding language support. What follows is a bit lengthy, as I'm trying to work out the pros/cons of this approach, especially given alternatives, such as open source toolchains from IntelliJ. Since it's so long, I'll summarize here:

In summary, I think it's worth trying to integrate it, as it's open source and might have a fair bit of payoff, but equally I think Semantic will fall short of its goals beyond Github because it's not extensible enough by outside contributors and might by design toss away too much source code knowledge in the pursuit of a clean CFG when users likely would appreciate language- and framework-specific feature annotation in such a graph, up to and including source code changes over time (between build steps) and information derived from executable steps such as framework-specific loaders of different config files, etc. I also think the long-term solution for Sourcetrail might be generic graph support, or an LSP-like protocol specifically for graph traversal and UIs like Sourcetrail.

While I think it's a legitimate concern that first-party compilers will often support language features first or best, in creating a tool built on such compilers one has to make generalizations or support all the new features all the way into the graph and how it's shown in the UI. If we thus say that generalizations are useful, and we want cross-language support, having one compiler that can handle multiple languages with the same AST conventions, you can support the generalized AST, without worrying about precise details. This is, I believe, how the IntelliJ series of editors builds out language support, but as with IntelliJ, the approach has the drawback that you can't immediately use the new language until support is updated. LSPs were interesting because they encode what an IDE needs, but are maintained by the compiler teams (generally), they just never specified a way of publishing an AST or CFG because, presumably, VS Code doesn't natively support those features and compiling them takes more time (is harder to use with live, up to the second responses.)

The biggest advantage here is if Github broadly adapts this and continuously updates it into their products, the weight and importance of Github means if it's easy to do, compiler vendors and the open source community will continuously update this kind of library as new features are added. The biggest disadvantage is that unless you version the generalizations used in your AST, you might eventually hit a limit where a language has a feature or features that's hard or awkward to represent within your existing "ontology" as it were.

Still. That's not a reason not to use it. I will say it's probably not "free" to use. Like IntelliJ's IDEs, the power doesn't actually come from understanding the AST or CFG, it comes from further upstream understanding. For instance, a common model or ontology (graph node/edge types) for MVC could be developed, or service endpoints, or project file types, and each of these might be "plugins" between both a system like Semantic and a UI like Sourcetrail, though you'd probably need to make the Sourcetrail UI more flexible to allow for arbitrary node types to be registered, and to likely have support for language tied not to files but to source locations.

Considering further, completeness in reasoning is perhaps the biggest concern. https://github.com/github/semantic/issues/57 is an example. We're looking for magic that lets us go from "right-click a string in an IDE and specify the context as embedded CSS, or embedded SQL" to "never need to manually specify anything, it just 'works'". We want to go from ambiguity to encoded conventions to the kind of understanding that helps you write documentation automatically. Ideally, you'd be able to build custom linting tools on top of such understandings, easily map running code to deploy scripts, documentation and model functionality to test cases, and so on.

To that end, you'd be looking for a system that doesn't just understand TypeScript or Java, but understands Drupal, Wordpress, React or Spring. And like Language Server Protocol, how far is a project willing to go? Will they stop at single-language CFGs, move on to multi-language CFGs, map ambiguous or less defined runtime environments, and start building plugins to support the different ontologies or conventions of specific frameworks?

An ideal system is configurable enough via UI and API that you can define and map real world abstractions to data types and data flow graphs, in addition to control flow graphs. Graphs themselves are more general purpose than control flows, as you could scan a database to create an ERD, or a Kubernetes system for a graph of deployments, pods, and other K8S concepts. This is I suppose why I keep coming back to "ontologies" here when it's perhaps not that common to say it this way.

If the system with the best plugin support wins, then my suspicion is that such a system either needs native bindings in multiple languages so coders familiar with those languages can implement bindings and concepts for that language's frameworks, and framework authors can maintain them, or it needs a very common language as a go-between, such as JavaScript.

The dynamic parts of languages are also trouble, I'm thinking of how Babel or Webpack frequently insert themselves into codebases in ways in which affect the runtime, but only under certain conditions. So it's worth noting that it's very possible we'll need "partial CFGs" which temporarily skip layers to make it easier to use/reason with as source code, vs "full CFGs" that include all the layers but are only useful to scripts and plugins that work on production code.

Another way to put this is that the same source code or project likely has multiple trees and graphs, the cheapest of which to compute is the concrete syntax tree, but from there you might have to model the compiler or build steps in addition to dependencies, to come up with a "perfect" set of graphs or transformations as the code moves from one stage to the next. I'm imagining this as a slider, for each file, where you can go from completely untreated source code on one side, to generated code, compiled code, different configurations of deployed code, and maybe even source maps to code compiled for different targets or module systems. Each has both links from one representation to the next, through the build steps, but also their own unique (optimized?) and perhaps less-readable control graphs.

An example of how flexible Semantic is currently, and a discussion of some of the above, comes up in https://github.com/github/semantic/issues/217 where they note "an approach with tree-sitter as the lingua franca is a lower maintenance burden than trying to corral N different language servers into a common vocabulary" - so they're using http://tree-sitter.github.io/tree-sitter/ and an interesting look at how C# support is being expanded can be found here: https://github.com/github/semantic/issues/156#issuecomment-554516576 which points to the pretty excellent documentation at https://tree-sitter.github.io/tree-sitter/creating-parsers#writing-the-grammar

So I'd start perhaps by saying that building a syntax highlighting and initial pass at code "blocks" using a concrete syntax tree from tree-sitter or a LSP server would be ideal, because you really want syntax-specific understandings and both handle more than a simple regex would.

It's less clear (to me) whether it makes more sense to go the Github/Semantic route of building a CFG without language-specific bindings, or take the (more sensible?) approach of creating a C API and then allow for language-specific adjustments or plugins in mapping the concrete syntax tree to more abstract ontologies and graphs.

Semantic's design is documented here: https://github.com/github/semantic/blob/master/docs/program-analysis.md and it looks like the biggest limitation it has appears to be that all mapping from the concrete syntax trees produced by tree-sitter to abstractions useful to further "evaluation" is done by their assignment code: https://github.com/github/semantic/blob/master/docs/assignment.md My Haskell is rusty, but the "Relevant Links" in that last doc are very useful. Particularly https://github.com/github/semantic/tree/master/src/Data/Syntax as it documents the ontology they're mapping the tree-sitter output to, its restrictions and assumptions. Then each language-specific assignment presumably takes care of mapping concrete syntax to these abstractions.

I guess where I'm going with this is every time you simplify your graph or tree, you're usually doing so by discarding information. This is totally fine -- Sourcetrail is built on the idea that you'd prefer a simplified graph that focuses on one thing at a time instead of a complicated graph that shows you overwhelming detail. The difference is, if your source isn't the compiler, or in some cases, the runtime (Smalltalk comes to mind, or reflection), it's entirely likely you won't have all the semantic information needed unless you rebuild the compiler's own inferences and keep them up-to-date.

And if one of the first steps you do is discard information instead of mapping it to an intuitive UI/understanding as perhaps LSP does, you're left with an incomplete picture and would have to go back to the source material, or keep filling in details as they come up as necessary. LSP encodes generalizations common to IDEs, so it's less useful for code scanning. The question then is, how many generalizations will Semantic have such that it's more fit to Github's source navigation and syntax highlighting than ours, in Sourcetrail's more graphical representations and groupings?

It is arguable though, there's no harm in trying -- we can write experimental bindings and just see how it goes. My suspicion at this point, though, is that Haskell as a single-language dependency is going to make it harder to attract contributors from each native language group you're looking to support, and that if your output isn't itself ontology-based through plugins that can presumably interface with native-language tools and frameworks, your ability to quickly build ontological representations of source code will be hindered by how much more effort you'll have to spend writing the Haskell. Just my two cents at this point...

As another alternative to a plugin system, you can outsource everything language/indexing related to individual indexers that operate on an extensible graph data format. Microsoft's done some of that in VS Enterprise, they've a directed graph markup language so you can generically describe a directed graph and they render it: https://docs.microsoft.com/en-us/visualstudio/modeling/map-dependencies-across-your-solutions?view=vs-2019 Similarly, IntelliJ's IDE plugins for diagram support appear to use a yWorks renderer with a custom UML XML format, but if they enabled its export, GraphML might be a logical choice. For a simpler format, there's always RDF to describe a graph, or JSON-LD. The trick would be to create a vocabulary or ontology that the Sourcetrail UI supports extensibly enough that it can keep track of language or framework-specific custom types of nodes and edges, presumably based on existing visualization UI for types, or with some kind of styling specification or drawing code attached (I'd suggest UI plugins but they always seem to slow things down...) You might want to write sample code for handling live updates via graph diffs, but it strikes me that we might be spending too much time trying to semantically map concepts across languages when it might be easier to split this in two parts: the first produces the possibly over-complicated graph, and the second simplifies the graph for display... making the second efficient is a concern, it might require creating either simplified graphs, or a LSP-style design where we ask a language server for a graph in a just-in-time fashion.

LouisStAmour commented 5 years ago

Speaking of a language server design, @slimsag mentions LSIF and call hierarchy in LSP as possible options also, in https://github.com/CoatiSoftware/Sourcetrail/issues/685#issuecomment-555917831

I’m thinking it would be good to investigate all of these approaches for quick wins and possibly reduced maintenance, but I do think the approach that can most quickly give us framework/project level details across a project/repo in an easily extensible way would be a “future-proofed” approach for Sourcetrail and related tools. Just the language details alone won’t be enough for understanding over time.

jraah542 commented 4 years ago

Semantic is a Haskell library and command line tool for parsing, analyzing, and comparing source code.

In a hurry? Check out our documentation of example uses for the semantic command line tool. It will allow better support for other languages as well:

Priority Language Parse Assign Diff ToC Symbols Import graph Call graph Control flow graph 1 Ruby ✅ ✅ ✅ ✅ ✅ ✅ 🚧
2 JavaScript ✅ ✅ ✅ ✅ ✅ ✅ 🚧
3 TypeScript ✅ ✅ ✅ ✅ ✅ ✅ 🚧
4 Python ✅ ✅ ✅ ✅ ✅ ✅ 🚧
5 Go ✅ ✅ ✅ ✅ ✅ ✅ 🚧
PHP ✅ ✅ ✅ ✅ ✅
Java 🚧 🚧 🚧 🔶 ✅
JSON ✅ ✅ ✅ N/A N/A N/A N/A JSX ✅ ✅ ✅ 🔶
Haskell 🚧 🚧 🚧 🔶 🚧
Markdown ✅ ✅ ✅ 🔶 N/A N/A N/A

✅ — Supported

🔶 — Partial support

🚧 — Under development

It uses tree-sitter for parsing.

CoatiSoftware / Sourcetrail

Integrate with GitHub's Semantic #750