Modifying the tree-sitter grammar

bbatsov commented 1 year ago

As a response to my article https://metaredux.com/posts/2023/03/12/clojure-mode-meets-tree-sitter.html someone asked if it'd be possible/easy to modify the tree-sitter grammar used by clojure-ts-mode. (e.g. teach the mode about some macros) I know that obviously they fork the grammar and build custom binaries, but I'm wondering if there's a simpler way to make some changes. I guess we should document this somewhere.

sogaiu commented 1 year ago

Not sure what "teach the mode about some macros" means.

Would you mind elaborating on that?

Is that referring to something other than highlighting?

bbatsov commented 1 year ago

Is that referring to something other than highlighting?

I was thinking both of highlighting (as I assume that tree-sitter by itself can't tell apart a function from a macro and those have to specified explicitly) and semantic indentation (for macros that take forms as arguments).

sogaiu commented 1 year ago

Thanks for the clarification.

Yes, the current grammar does not distinguish between functions and macros. It also doesn't try to identify special forms. There is a summary of some of the background about why here.

The short of it is that the (multiple) attempts I made before to add support for things like def and defn (even just those two) resulted in high levels of parsing errors and I didn't find that acceptable.

To me the (more) correct parsing was more important, but there's nothing that says that technically you can't have more than one grammar. It's just that I wanted something that would work well to be able to do things like structural editing / navigation decently.

Perhaps you are already familiar with the folllowing, but for the sake of clarity...In the tree-sitter world, one grammar can inherit from another (e.g. tree-sitter-commonlisp inherits from tree-sitter-clojure), so that's one path. What you mentioned earlier about customizing an existing grammar is another path. In either case though, I don't think one can currently expect any kind of runtime tuning (i.e. editing the grammar file, generating parser source from that, and finally creating another dynamic library seems unavoidable [1]).

Also, at least for the pre-Emacs-29 world that used elisp-tree-sitter, I asked the maintainer at one point about using more than one grammar in a single buffer, and my understanding was that it probably could be made to work:

If you want 2 parse trees in the same buffer instead, you would need to define an advice for tree-sitter--do-parse, as well as additional buffer-local variables for the secondary grammar.

I bring up this idea because this route might make it possible to use one grammar for one purpose while using another for another, each possibly more suited (e.g. being more accurate) for a different sort of use. Of course, a single choice that worked well would be nicer.

No idea how things are in Emacs 29+ though.

Now that @dannyfreeman is up-to-speed on how the grammar works, may be he'd be interested in seeing if adding support for additional constructs is feasible / practical.

Supporting more than the bare basics is done by one of the Fennel grammars and one of the Janet grammars [2], but as far as I know, there is a cost involved in accuracy (I wrote simpler versions of both of those and compared at one point -- but it's been a while so it's possible things have changed).

Compared to Clojure I think both of those lisp-likes have smaller cores and I believe neither grammar tried to cover all constructs but I'd need to check what the status is currently.

If it's found that adding a few things is feasible, I would imagine one of the next things to consider might be what else. We all know how large clojure.core is...

[1] This currently requires the tree-sitter cli (which itself needs node) to create a .c file and then that subsequently needs to be compiled by a C compiler.

[2] IIUC, there are also grammars for Emacs Lisp and Racket that might be worth examining. I don't know how much testing has been done for these.

dannyfreeman commented 1 year ago

To elaborate more, there really isn't a way to extend the grammar without creating new tree-sitter binaries. Supporting things like fancy macros will best be done in emacs-lisp, or whatever platform is consuming the platform. Queries can be written to match forms like a list, whose first child is the symbol defn, match the symbol (aka the name) following defn, to pick up on function definitions for example.

This sort of thing is going to be required for semantic indentation. I already do it to some extent for syntax highlighting (example)

bbatsov commented 1 year ago

Thanks for the detailed explanations from both of you. Now I understand the situation a lot better. I think it would definitely make sense to document some of those design decisions and limitations, so it's easier for the end users to understand why certain things were done they way they were.

I'll also take a closer look at the resources shared by @sogaiu. Btw, it might also be a good idea to add some general "understanding tree-sitter and how major modes based on tree-sitter work" sections with some pointers to external resources, so potential new contributors would have a good starting point.

sogaiu commented 1 year ago

Re:

resources shared

There is a list of grammar repositories here along with some tree-sitter-related questions / summaries. In addition to the grammars mentioned earlier, there's at least one grammar for "scheme" and a fork-with-changes of the elisp one mentioned above.

There may be still others.

Re:

some general "understanding tree-sitter and how major modes based on tree-sitter work" sections with some pointers to external resources, so potential new contributors would have a good starting point.

I don't know how up-to-date the following is, but apart from looking through existing *-ts-mode.el files, I looked at the content beneath here when working on a different *-ts-mode.el file.

On the Emacs end of things, paying attention to the emacs-devel mailing list and the source repository seems to work for keeping up-to-date. I don't have a good idea of how stable things are / will be -- may be that's something that could be queried about at the mailing list.

In the not-so-distant past, a discord server was started for discussing tree-sitter things. It was announced here. One of the maintainers (though AFAIK, not the original creator of tree-sitter) hangs out there and has been helpful. This channel of communication might be preferred over the tree-sitter repository's issues / discussions for some types of queries.

One thing I didn't mention earlier is that the grammar currently being used in clojure-ts-mode has existing users in other programs -- here is an incomplete list. I mention this as making major changes to this grammar at this point from a feature perspective may lead to breakage elsewhere, so I'm not so inclined to go in that direction [1]. That's not to say that a different one couldn't be created of course :)

[1] At least not without some way to find out who is using the grammar in what way and establishing good communication channels with those folks...not something that seems practical unfortunately.

dannyfreeman commented 1 year ago

I think it would definitely make sense to document some of those design decisions and limitations, so it's easier for the end users to understand why certain things were done they way they were.

I'll work on this, probably over the weekend. Either expanding on the README or a new markdown doc linked from the README.

I mention this as making major changes to this grammar at this point from a feature perspective may lead to breakage elsewhere, so I'm not so inclined to go in that direction [1].

And just to add to this, it is very hard to make changes to the grammar that are not breaking in some way to one of the downstream users of the grammar. Even adding new nodes could be breaking in some way, because before the change the same text capture by the new node was captured by a different type of node.

sogaiu commented 1 year ago

Re:

document some of those design decisions and limitations, so it's easier for the end users to understand why certain things were done they way they were. ...

I'll work on this, probably over the weekend. Either expanding on the README or a new markdown doc linked from the README.

In the last few days I revisited:

the (multiple) attempts I made before to add support for things like def and defn (even just those two) resulted in high levels of parsing errors

It turns out there is a feature that allows one to perform queries of supertypes. AFAIK, this isn't included in the official docs, though it is mentioned in the Tree-sitter 1.0 Checklist:

Document the ability to match against supertypes in queries with the expression/identifier syntax.

I have tried it out a bit:

making list_lit a supertype and
having list_lit then be a choice among a def, defn, and something generic

and at least so far it hasn't resulted in large numbers of parse errors.

I'm not sure yet whether this would be compatible with the existing grammar, but it might be worth further investigation.

Addenum: It looks like I tried this out a bit back and wrote about it here. It's not clear to me with the above approach whether one will be able to tell apart a use of (def a 1) from code that constructs a list with the 3 elements def, a, and 1 (e.g. in a macro definition).

dannyfreeman commented 1 year ago

It's not clear to me with the above approach whether one will be able to tell apart a use of (def a 1) from code that constructs a list with the 3 elements def, a, and 1 (e.g. in a macro definition).

This is the main problem with trying to apply semantic meaning to clojure code with tree-sitter. Potentially you could detect a def inside a quoted list, with basically 2 divergent parse paths, where everything (lists, symbols, keywords, etc) have a quoted_ variant that is used when parsing things inside a quoted form. What we will never be able to account for in tree-sitter is something like a macro like

(defmacro foo [some-list] ...)

and then calling it with

(foo (def x 1))

because tree-sitter has no way of understanding that foo macro, which could easily not be emitting a def form. Tree-sitter only looks forward, never back at what it has already read.

To really get an accurate understanding of clojure source code, tooling needs context of what has already been parsed. It basically needs to execute the code. Knowing that I'm content with the grammar as is and writing some simple queries at run time to guess when something is a definition. It's more flexible to do it in emacs lisp. When we inevitable parse some weird def code wrong, it's quick to fix and doesn't force changes onto other consumers of the grammar.

All of this is good info to have in the documentation bbatsov is requesting. Going to try to work on it today. I'll probably draw from a lot of your prior art @sogaiu.

sogaiu commented 1 year ago

@dannyfreeman FYI, may be you've seen already but I've updated some of the docs in one of the pre-release branches. Possibly the content may be helpful / more accurate than what is currently "released".

bbatsov commented 1 year ago

To really get an accurate understanding of clojure source code, tooling needs context of what has already been parsed. It basically needs to execute the code. Knowing that I'm content with the grammar as is and writing some simple queries at run time to guess when something is a definition. It's more flexible to do it in emacs lisp. When we inevitable parse some weird def code wrong, it's quick to fix and doesn't force changes onto other consumers of the grammar.

I'd focus first on the more common cases (provide accurate indentation the known Clojure special forms and built-in macros that do something with code) and not fret too much on what macros might do in general. I've always been fond of tackling problems a step at a time - getting something perfect from the get go is a pretty tall order.

If we can add some mechanism when end users can specifies the indentation rules for their own macros as in the current clojure-mode that'd be pretty nice as well.

dannyfreeman commented 1 year ago

I'd focus first on the more common cases (provide accurate indentation the known Clojure special forms and built-in macros that do something with code) and not fret too much on what macros might do in general. I've always been fond of tackling problems a step at a time - getting something perfect from the get go is a pretty tall order.

That's my general strategy. You can see it somewhat in the current clojure-ts-mode font-lock rules that check for common definition forms (def, defn, defmacro, etc). For the time being that's all clojure-ts-mode will be capable of.

If we can add some mechanism when end users can specifies the indentation rules for their own macros as in the current clojure-mode that'd be pretty nice as well.

I agree, it would be nice to get there. I do not believe we will be able to use the provided in treesit-simple-indent-rules mechanism for that, it is not dynamic enough. We'll have to write some more complex code for that. It is an attainable goal, just more long term.

dannyfreeman commented 1 year ago

Thanks for the detailed explanations from both of you. Now I understand the situation a lot better. I think it would definitely make sense to document some of those design decisions and limitations, so it's easier for the end users to understand why certain things were done they way they were.

I've got a document going for this now BTW https://github.com/clojure-emacs/clojure-ts-mode/blob/main/doc/design.md

I plan on expanding it more soon, but it is a good start I think.

sogaiu commented 1 year ago

Possibly useful to make a distinction between concrete syntax trees and abstract syntax trees. See this section for some details.

dannyfreeman commented 1 year ago

Added. The named vs anonymous nodes were useful to describe as well.

sogaiu commented 1 year ago

Some minor things:

Looks like there's a bit about "abstract" left from before:

The generated parsers can create abstract syntax trees from source code text.

Also a stray character at the end of the following may be?

In clojure-ts-mode, "(" and ")" are anonymous nodes.n

dannyfreeman commented 1 year ago

Great catches, I fixed that up. Thank you :)

Asjnhbv commented 3 months ago

2db016dc64f287fa541c97b922d20f493fedf403

clojure-emacs / clojure-ts-mode

Modifying the tree-sitter grammar #4