cursorless-dev / cursorless

Don't let the cursor slow you down
https://www.cursorless.org/
MIT License
1.12k stars 77 forks source link

Support multi-language documents #409

Open awulkan opened 2 years ago

awulkan commented 2 years ago

Now that Cursorless supports HTML (great job!) it would be amazing if it could also support nested languages inside of HTML. For example JS/TS inside of <script> tags, and CSS/SCSS inside of <style> tags.

This would be super helpful for web developers. Because plenty of popular frameworks (such as Vue) use single file components, where HTML, JS and CSS are all located in the same file.

I assume that this documentation is relevant to this issue: https://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents

Will-Sommers commented 2 years ago

One interesting possible follow on to this is the ability to define an intermediary language that would allow for more fluent chaining of commands. For example, when making a list in Ruby or JS/TS, I assume I need to add commas.

[1, 2, 3, 4]

What if I could create the following and then say "spamma items". I'm not sure this would work now because of the underlying list would not be parsed as such.

[1 2 3 4] 

This example seems trivial but I'm finding myself using bring more and more and this might be cool for that.

pokey commented 2 years ago

@Will-Sommers yes that would be nice to be able to do, but I guess I'm not sure how we'd know treat this fragment as a separate language if it appeared in a Ruby document

Will-Sommers commented 2 years ago

Heyo, I reached out and asked one of the GH engineers about multi-language documents and it looks like there is support within tree sitter. Here's the example he linked.

That being said, being able to parse is just one step. Right now it looks like we rely on the editor's language selection to indicate which language we are in and which parser to use.

Steps for further investigation:

I think that something like syntax within markdown might be sort of difficult unless you use an extended info string ala Github Flavored Markdown.

pokey commented 2 years ago

Yeah that injection stuff looks like the way to go. Fwiw VSCode knows that the segment is a different language, as evidenced by the fact that it's able to do syntax highlighting, but I don't think we can get that info. Worth a quick look I guess. But if we can't use that, then def the tree-sitter injection stuff seems like the way to go

Will be interesting to see if you can get that injection stuff to work

Yeah for markdown I'd use the extended info string; looks like tree-sitter has support for determining language from the text of a token, as described in that doc you sent

Will-Sommers commented 2 years ago

Heyo, it does look like this is able to exposed to us via VSCode using a language server extension, either one that we import and depend on or one that ships with VSCode and does things like multi-language syntax highlighting. Here are the two VSCode approaches outlined.

I think this is a better approach than relying on a parser since it will give more flexibility. We really need to be able to say is — We're in x or y language when the cursor is within this block. I think then it would be easy to let cursorless, as it stands, take over with a specific languageId.

I think the extra flexibility will help support custom code blocks that have slight differences in their specs, like styled-jsx (language-server example), emotion or styled-components.

I think the compelling use cases to support here are likely:

I think that the first two are tied in my mind, but the second case seems more commonly used than the first these days.

pokey commented 2 years ago

Looking at that VSCode example, it looks like it may be a bit complex to get that direction working. It seems we need to fork each lsp extension we want to use?

One thing to think about for either approach is how we handle referring to a scope type in the parent tree. For example, if we're inside an embedded code block in markdown, and say "take section" to select the markdown section that contains our code block.

Also, in the future we may want to be able to compute a list of the language ids of all visible editors. This way we can narrow down the set of scope types that are active in talon lists. That will potentially become important as we support more and more scope types, eg foe things like latex where you want "environment", etc. Prob not worth engineering too hard for right now, but just a slight consideration to keep in mind

pokey commented 2 years ago

Fwiw in cases where injections that ship with tree-sitter repos are lacking, we can borrow from these:

Also worth looking at how these projects implement injections to see if it's helpful

Will-Sommers commented 2 years ago

Cool! Thanks for the links, I'll take a look at them tonight and see a bit more. Unrelated to this main issue, I think there could also be some interesting things here as well wrt naming of nodes.

This is really cool. I need to think a bit more about it.

Will-Sommers commented 2 years ago

Heyo, just some notes here:

So it looks like what neo-vim does and test suite for tree-sitter does is pass in the queries included in each tree-sitter project into each parser's language, this returns a new query object which then can be used to match against rest of the file. This will return a set of matches.

Within each match there's field called setProperty which is where the injection language is stored. From there, there are one more more captures for the match, each returns a Node of raw_text type which has the normal start/endPosition fields. Also, the entire contents are available via text.

e.g.

parse = languages["html"].parser.parse(`<html><script>
 let a = 10;
  </script></html>`)

query = parse.getLanguage().query(`((script_element
  (raw_text) @injection.content)
(#set! injection.language "javascript"))`);

query.matches(parse.rootNode)[0].setProperties;
> {injection.language: 'javascript'}

The injection.language names are specified in each tree-sitter project's package.json. Highlight and injection paths are also specified there.

From the tree-sitter docs, it looks like from here, we need to reference that parse and use that parser's language for a re-parse.

    @injection.content - indicates that the captured node should have its contents re-parsed using another language.
    @injection.language - indicates that the captured node’s text may contain the name of a language that should be used to re-parse the @injection.content.

Quickly looking at the queries files within neo-vim, I think we can crib a bunch from them. It looks like within the .scm files there is access to the tree-sitter api, so there should be matching on node types, text and on....

I'll start to think about how this might change things. For one, it looks like this supposes multiple trees, since one tree is returned from each parse.

pokey commented 2 years ago

Awesome! Solid research. We should think about how this connects to rewriting our language defs using this query language. Cc/ @wenkokke

One minor thing I'll add: helix also maintains its own injection regexes in case for whatever reason we need to steal those. They're all in one file, see eg rust

pokey commented 2 years ago

It's also worth thinking bout how this works with incremental parsing

pokey commented 2 years ago

Also wrt the multiple tree thing, I don't believe there's any reason we can't just graft the trees together ourselves, right?

Tho maybe it's better to leave the trees untouched, and then just maintain our own map from nodes to injected subtrees

pokey commented 2 years ago

Also I think I'd argue we should try to push as much as possible into the parse-tree extension rather than cursorless

Will-Sommers commented 2 years ago

Heyo, I'll think about all of these notes. I thought about this a bit more and think that we should first approach looking at languages where the sub-language is part of the grammar and there's a node similar to raw_text that we can rely on.

The other case handling cases where a transpilation step looks at the text and handles the inner blocks in a separate fashion, a la CSS in JSX.

It's also worth thinking bout how this works with incremental parsing

Yep! This query can happen off of any SyntaxNode, I used the rootNode as an example above.

Also wrt the multiple tree thing, I don't believe there's any reason we can't just graft the trees together ourselves, right?

Just to be clear, what you're advocating is, for example, taking the SyntaxNode for a script element and then replacing the raw_text child with a different SyntaxNode. There would be two different Tree objects in this case, one coming off of the main rootNode and then one coming off of the newly grafted SyntaxNode.

Also I think I'd argue we should try to push as much as possible into the parse-tree extension rather than cursorless

I agree in principle on this but I'm curious if it will be the best way. We'll need to see how Tree is handled in this case since that is where Language is stored and referenced.

Will-Sommers commented 2 years ago

re: Tree-Sitter being able to handle this by default.

injection.include-children - indicates that the @injection.content node’s entire text should be re-parsed, including the text of its child nodes. By default, child nodes’ text will be excluded from the injected document.

This is meant to go in the .scm file, but looking at some of the implementations, I don't see it being used. I'm going to reach out to one of the neovim devs and ask them about it.

Update: I reached out to one of the devs via email.

pokey commented 2 years ago

It's also worth thinking bout how this works with incremental parsing

Yep! This query can happen off of any SyntaxNode, I used the rootNode as an example above.

Just to be clear, by "incremental parsing", I mean updating the parse tree as the document changes, which tree-sitter is able to do efficiently without reparsing the entire document. Here's where we do it in the parse-tree extension: https://github.com/cursorless-dev/vscode-parse-tree/blob/4af875b7cbd72d68c1e1eafe43ddabc3403264ce/src/extension.ts#L109-L134

pokey commented 2 years ago

I thought about this a bit more and think that we should first approach looking at languages where the sub-language is part of the grammar and there's a node similar to raw_text that we can rely on.

The other case handling cases where a transpilation step looks at the text and handles the inner blocks in a separate fashion, a la CSS in JSX.

Sorry—not sure I understand the difference between these two cases. Can you elaborate? Or maybe worth chatting on discord?

pokey commented 2 years ago

Just to be clear, what you're advocating is, for example, taking the SyntaxNode for a script element and then replacing the raw_text child with a different SyntaxNode. There would be two different Tree objects in this case, one coming off of the main rootNode and then one coming off of the newly grafted SyntaxNode.

Not really advocating one direction or the other tbh, just brainstorming

pokey commented 2 years ago

I agree in principle on this but I'm curious if it will be the best way. We'll need to see how Tree is handled in this case since that is where Language is stored and referenced.

Sorry, I don't follow. Maybe another thing for a discord

Will-Sommers commented 2 years ago

Adding more notes — it looks like NeoVim created their own data structure to track child trees as well as language injection/queries. [link]

Looking more at this, including in the document example where anerb file is processed, three trees with overlapping ranges are returned. I think we'll need to write something on our side to handle this.

josharian commented 12 months ago

I think there are going to end up being some interesting design challenges here, for which we are going to have to develop principles as we go.

This came up recently in the context of .talon file support, which kind of is two languages: k/v pairs and talonscript as values of keys. There was discussion about what key should refer to in some cases: the key associated with a chunk of talonscript, or the key inside the bit of talonscript. It's not hard to imagine similar challenges occurring regularly in a general multi language context.

Mark-Phillipson commented 7 months ago

There is also dotnet languages to consider here that also mix with HTML: