getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
13.04k stars 921 forks source link

Investigate tree-sitter to replace syntect #1787

Open Keats opened 2 years ago

Keats commented 2 years ago

Has anyone used it? The last time I looked at tree-sitter it didn't have many grammars but a quick look shows it's getting better. Our syntect syntaxes are stuck on old versions of the grammars because of new features in the Sublime grammar format not supported by Syntect. See https://github.com/nvim-treesitter/nvim-treesitter#supported-languages for a list of supported languages.

An alternative would be a basic textmate highlighter using VSCode syntaxes/themes since that's what everyone seems to be using these days.

Jieiku commented 1 year ago

I took a look at shiki and it looks pretty nice. My understanding is that it uses grammars similar to syntect, but in the case of shiki it is able to make it look exactly like vscode. I like shiki because it seems like it would be more flexible.

rauschma commented 1 year ago

A Rust port of Shiki would be great!

FWIW – Pandoc uses the syntax highlighting library skylighting (written in Haskell). But I don’t know if it would be easier to port or not.

Two more popular libraries:

Keats commented 1 year ago

With shiki you get all the syntaxes/themes from VSCode for free, which is the main draw. Otherwise a port of something like pygments, prism, highlight.js would work but it's less interesting.

Keavon commented 1 year ago

Shiki definitely looks like the best option if an effort to port it will happen!

Keats commented 11 months ago

There's some very promising work on improving tree-sitter start time: https://github.com/tree-sitter/tree-sitter/pull/2374

mwcz commented 11 months ago

I took a look at Shiki to determine how much work a Rust port would be. It doesn't seem too hard, but there's one hitch: TM grammars use Oniguruma regex, and there's no Rust port of that either, just FFI bindings. Porting that would be much more difficult than porting Shiki, since Oniguruma is 85,000 lines of C vs Shiki's 7,000 lines of TS. The FFI bindings could work, but only if @Keats is okay with having Oniguruma be a build-time dependency statically linked into zola.

The above is all moot of course if someone can show the Oniguruma regex syntax to be close enough to regex (or some other rust regex crate) that practically every TM syntax file out there would be supported.

Keats commented 11 months ago

That's what we do with syntect already, through https://crates.io/crates/onig

mwcz commented 11 months ago

My bad, I didn't see it mentioned in the guide for installing from source.

An easy starting point for porting shiki seems to be handling TM grammars. I couldn't find a Rust implementation of a TM grammar deserializer, so I started one here: https://github.com/mwcz/textmate-grammar-rs I could use some help finishing it up. Or, if there is a crate out there that I missed, please correct me. :sweat_smile:

Keats commented 11 months ago

There's no textmate parser in Rust afaik, I had a look before :/ I did something slightly similar with a WIP pygments parser but didn't get very far.

In a world where loading tree-sitter is fast (< 50ms) and could be improved (eg a Zola user could list the language they use in Config.toml so we only load those) which one would we prefer between tree-sitter and shiki?

Advantages of tree-sitter:

Cons of tree-siter:

Pros of shiki-like:

Cons of shiki-like:

Jieiku commented 11 months ago

I am curious how long it would take to port shiki to rust... days? weeks? months? A rust port of shiki could be a really fun project if its not too large a project to get something up and running in a reasonable amount of time. (a couple hours to implement tree-sitter in Zola is certainly fast!)

mwcz commented 11 months ago

The TextMate grammar parser is about all I have time for, but I can continue to improve it if someone else (@Jieiku??? :grin:) is interested in doing the rest of Shiki. I really don't want to clutter this tree-sitter issue with updates about TM grammars, so here's a last update, unless things do start moving strongly towards a shiki port.

textmate-grammar-rs update The TM grammar description is extremely loose and thus raises a lot of questions when implementing a parser. I chose to only capture fields that are defined in the official TM grammar description, which leaves _many many_ non-standard fields uncaptured. Still, I got it to a point where it can parse a good chunk of the grammars included with Shiki. ![image](https://github.com/getzola/zola/assets/364615/7ce67b01-6046-4d34-a941-77f78e4151be) 3 of the failures are minor and can be fixed with a little more serde tweaking. The remaining 38 failures are from grammars with regex patterns incompatible with Oniguruma (eg, `Oniguruma error: invalid backref number/name`). How they work in shiki, or if they work at all, I have no idea. Not really sure what to do about these. :shrug:
typesanitizer commented 9 months ago

potentially faster highlighting? (anyone has benchmarks between tree-sitter and syntect?)

At Sourcegraph, we've switched from syntect to tree-sitter for major languages because of performance. I did some benchmarking in Dec 2021, here's the performance report.

We haven't been able to drop syntect because of the long tail of languages not supported by tree-sitter.

Some differences between the Sourcegraph and Zola use cases:

For small snippets, the highlighting performance probably doesn't matter as much, and syntect's typical speed of about 50k SLOC/s per core should be good enough. That said some grammars like Scala and C# were, depending on the code, about an order of magnitude slower, and we'd not infrequently hit 10s timeouts.

Keats commented 9 months ago

Thanks for the perf report! It's really all about the startup time for us in practice so the PR linked above for tree-sitter (or similar) is a requirement to be usable. Not much activity on it sadly

Keats commented 8 months ago

https://github.com/tree-sitter/tree-sitter/pull/2594#issuecomment-1716623829 so it should come eventually!

si14 commented 2 months ago

I was today years old when I realised that even async fn is not highlighted :(

Screenshot 2024-05-13 at 14 33 08

Maybe at some point the build time hit become tolerable (even if tree-sitter doesn't land those caching PRs) just to have up-to-date parsers?

jalil-salame commented 2 months ago

I was today years old when I realised that even async fn is not highlighted :(

Screenshot 2024-05-13 at 14 33 08

Maybe at some point the build time hit become tolerable (even if tree-sitter doesn't land those caching PRs) just to have up-to-date parsers?

This is because syntect doesn't support newer syntax files. I'm working on improving that so it might not be necessary in the future. No promises though, I'm strapped for time...

phisch commented 2 months ago

For me personally, I'd rather have a (even significant) performance hit, but better syntax highlighting. There is also the option to implement both, make syntect the default (for performance), and treesitter an optional alternative through a config option.

Current syntax highlighting is just a bit disappointing in most cases I have used so far.

Walnut356 commented 2 months ago

If anyone wants an (admittedly jank) solution for the time being, .sublime-syntax files are essentially just a YAML file with regex instructions inside and aren't all that hard to modify in-place. Zola doesn't really care if it matches whatever sublimetext actually wants/expects, so you can define new regex matches and/or apply whatever custom scopes you want. If you use the highlight_theme = css config option, zola will automatically apply your scopes as css classes from the modified sublime-syntax file and then you can manually style those classes yourself. I'm not an expert at regex or css, nor have I ever used sublime text but I was able to get this working with a few hours effort.

Here's a .zip of the files i'm using for rust currently - the sublime-syntax file is based off of rust enhanced, styled to look like One Dark in VSCode. The modifications aren't pretty, but it does the job. Below is an example screenshot from my website:

image

image