Wilfred / difftastic

a structural diff that understands syntax 🟥🟩
https://difftastic.wilfred.me.uk/
MIT License
21.2k stars 347 forks source link

Add support for dynamically loading grammar libraries at runtime #123

Open jirutka opened 2 years ago

jirutka commented 2 years ago

Can you please add support for dynamically loading (any) grammar libraries at runtime? The current approach of vendoring every existing grammar and linking them in build-time is not scalable and not compatible with linux distributions.

This issue has been already solved e.g. in diffsitter (https://github.com/afnanenayet/diffsitter/pull/177) or helix (https://github.com/helix-editor/helix/pull/432).

You can either load a grammar library for the specified language from the system library path (this is typically /lib:/usr/lib:/usr/local/lib, but the path may vary between linux distros; the point is that you don’t depend on exact path, this is handled by libc) as any other system library – grammar libraries are expected to be named libtree-sitter-<lang>.so (diffsitter’s approach).

Or you can look up grammar libraries directly in some predefined directory (e.g. /usr/lib/tree-sitter or some app-specific dir) named as <lang>.so (helix’s approach).

The former is better, but both are acceptable for Linux distros – in the latter case, we can just symlink <your-directory-for-grammars> to e.g. /usr/lib/tree-sitter where system-wide grammars are installed by the system’s package manager (and used by all programs that use tree-sitter). These approaches are not mutually exclusive – you can support both.

Also, you can continue to provide some grammars with tree-grepper (as both diffsitter and helix do, they use git submodules) for the convenience of the users that build tree-grepper themselves and their distro don’t provide tree-sitter grammars yet (for example, Alpine Linux and Arch Linux (AUR) already provide many grammars). The aim is to allow package maintainers to package grammars separately and give users the freedom to choose what grammars they will install.

(I opened the same issue also in https://github.com/BrianHicks/tree-grepper/issues/88)

Wilfred commented 2 years ago

Ooh, are you interested in packaging difftastic @jirutka? That'd be really neat.

This seems like a reasonable feature to add, but it'll require a few fiddly details:

(1) Difftastic doesn't just use the parser's C library, it uses the syntax highlighting file(s) (usually named queries/highlights.scm) too.

(2) For each language, difftastic also has configuration that specifies when to use the parser (e.g. file extensions) and how to use it (which AST nodes map to diffsitter's AST).

(3) Difftastic doesn't provide any configuration files yet, just a few CLI options and environment variables. Typically difftastic is invoked from git or hg rather than by the user, so I've generally preferred environment variables so far. I'm not sure what a good configuration system looks like for difftastic, so it'll need some design work.

(4) In a few cases I've forked parsers. I try to upstream changes but some parsers are essentially abandoned.

jirutka commented 2 years ago

Ooh, are you interested in packaging difftastic @jirutka? That'd be really neat.

Yes, I’m Alpine Linux developer and difftastic looks like a very promising and useful project.

(1) Difftastic doesn't just use the parser's C library, it uses the syntax highlighting file(s) (usually named queries/highlights.scm) too.

So is e.g. Helix. However, Helix is not the best inspiration in this matter, because they have vendored all queries and it’s unclear which are just copied from the respective grammar projects and which ones are modified. They install the queries into $HELIX_RUNTIME/queries/<lang> where $HELIX_RUNTIME is typically /usr/share/helix/runtime when installed from a package provided by a (Linux) distribution.

On Alpine Linux, we package queries (.scm files) together with compiled tree-sitter grammars and install them into /usr/share/tree-sitter/queries/<lang>/.

The ideal solution would be to add support for searching queries in multiple base directories (per-site configurable or hard-coded, searched in this order):

  1. user directory for queries provided/installed by user (difftastic/queries under $XDG_DATA_HOME on Linux; see dirs crate)
  2. (system) directory with custom(ized) queries provided by difftastic (e.g. /usr/share/difftastic/queries on Unix-like systems)
  3. system directory with stock queries provided with the grammars (/usr/share/tree-sitter/queries on Unix-like systems)

The last two would be basically intended for the distributions packaging difftastic. In the case of manual installations, i.e. when the user builds (or downloads a tarball with prebuilt binary and other files) and installs difftastic themself, difftastic can provide an archive of bundled queries (both stock and custom; basically what you have now) and instruct the user (or provide a script fort hat) to copy them all to the first one (which is inside $HOME).

Note: /usr/share prefix shouldn’t be hard-coded, but built-time configurable by some mechanism in cargo, but I don’t remember how cargo handles this.

(3) Difftastic doesn't provide any configuration files yet, just a few CLI options and environment variables. Typically difftastic is invoked from git or hg rather than by the user, so I've generally preferred environment variables so far. I'm not sure what a good configuration system looks like for difftastic, so it'll need some design work.

Making the paths user-configurable would great, but I don’t think it’s necessary – convention over configuration would be IMO sufficient. The system locations, primarily used by distributions, can be just build-time configurable (even a constant in a rust source file with some sensible default which can be easily patched if needed would be sufficient, at least for Alpine).

For the user locations, there’s a convenient cross-platform Rust crate dirs that abstracts from the details of each platform.

(4) In a few cases I've forked parsers. I try to upstream changes but some parsers are essentially abandoned.

Upstreaming is the way to go. :+1: If some parser is abandoned, you can promote your fork as an up-to-date replacement and package maintainers will catch on (unless there are several competing forks and it’s unclear which one should be preferred).

Wilfred commented 1 year ago

https://github.com/Wilfred/difftastic/issues/438#issuecomment-1368348008 discusses the different approaches being used in the tree-sitter ecosystem for this.