getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
12.98k stars 920 forks source link

Investigate tree-sitter to replace syntect #1787

Open Keats opened 2 years ago

Keats commented 2 years ago

Has anyone used it? The last time I looked at tree-sitter it didn't have many grammars but a quick look shows it's getting better. Our syntect syntaxes are stuck on old versions of the grammars because of new features in the Sublime grammar format not supported by Syntect. See https://github.com/nvim-treesitter/nvim-treesitter#supported-languages for a list of supported languages.

An alternative would be a basic textmate highlighter using VSCode syntaxes/themes since that's what everyone seems to be using these days.

mwcz commented 2 years ago

I haven't used tree-sitter as a library but it's really nice in nvim.

jakelogemann commented 2 years ago

👍 ts is the bee's knees.

The official tree-sitter-highlight crate has a few nice examples in the README... I haven't used it before, but I'm interested to try...

ref

Keats commented 2 years ago

If it's adopted (I don't know yet, I need to see the theming capabilities and just try it on various inputs), it would be its own package that can probably be on crates.io as well. I'd like to move all the lines numbers/highlights etc in it.

Keats commented 2 years ago

List of parsers from neovim: https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/parsers.lua

Jieiku commented 2 years ago

What would the tree-sitter output look like? would it list a ton of classes in the generated html like the current syntect solution does when you use css mode? or would it use classes that refer to css variables which we can then set to specific colors and styles, eg: --z-1, --z-2, --z-3, etc... with a couple modifiers for bold, italic, and underline: --z-b, --z-i, --z-u

Keats commented 2 years ago

Ideally it would the same kind as the current syntect output

Jieiku commented 2 years ago

All the class definitions make the page source code much larger in size, but I can see how it would make things simpler as far as generating goes. If you simply used classes which refer to colors, then you would have to have some sort of lookup table per programming language. (because a bracket in one language might be colored, but in another language it might not be colored or colored differently)

I just wish there was a simple way to have much leaner generated html for syntax highlighting while using the css method.

Keats commented 1 year ago

I have the HTML renderer working, now to figure out which how to use a VS Code theme to link scopes with tree-sitter to know which colour to use...

Keats commented 1 year ago

I think I'll forget the VSCode themes as they can be in JSON, YAML or even JS. It can probably just be a tiny ~20L long key value ini file since it's not like we are going to have hundreds of scope.

Not much to show so far but I've set up https://github.com/getzola/giallo which is still very much a clone of the html renderer in tree-sitter-highlight so far since I got stuck on theming. The plan is to hardcode a theme in giallo for now and compare the output with the same theme in VSCode to make sure it's kinda close. The main issue is that the theme example I took (OneDark-Pro.json) has tons of language dependent colours (with scopes like constant.other.php) so it's never going to be really close if it's a common thing to have a lot of language-specific scopes. Anyone knows more about that?

Jieiku commented 1 year ago

This solution will be able to work with css right? I ask because the description in the top right of giallo says:

Syntax highlighter to HTML using tree-sitter, using VSCode theme, the wording does not mention css eg: HTML/CSS

Really appreciate the work on this, I do hope css will still be supported, let me know if you need any help/testing.

I am actually ok without language specific scopes on most things because the resulting output will likely be a lot leaner.

I cannot speak to what is the normally because a LOT of my editing over the years was in notepad, until a few years ago when I switched to using Atom, just recently I switched to Kate because of how long atom takes to load.

Keats commented 1 year ago

I do hope css will still be supported

Yes, it's just a matter of exporting a theme as CSS, which is trivial.

Jieiku commented 1 year ago

The main issue is that the theme example I took (OneDark-Pro.json) has tons of language dependent colours (with scopes like constant.other.php) so it's never going to be really close if it's a common thing to have a lot of language-specific scopes. Anyone knows more about that?

So are you mostly asking for feedback from people that use VSCode? Are you asking if it is common to have a lot of language specific scopes, or are you asking if it would be ok to have less language-specific scopes?

I can install vscode in a VM and play around with it for a bit (never used it before)

Did not realize vscode was open source was as easy as sudo pacman -S vscode unfortunately it is an electron app, but I went ahead and installed it so I can play around with it for a bit.

Keats commented 1 year ago

Mostly asking for people with knowledge of tree-sitter to see what they know about scopes. Also curious neovim-treesitter and how themes/scopes are defined.

VSCode

Screenshot 2022-08-19 at 00 00 59

Giallo

Screenshot 2022-08-19 at 00 01 12

Differences are probably due to me getting some scopes wrong when looking at the theme and/or missing some necessary scopes, I'll try to fix it when I'm not tired but it's kind of acceptable.

xse commented 1 year ago

Hey, I've played around with it, tree-sitter is really fast and has lots of languages supported! However something to keep in mind is that it works with programming languages, and does not have syntaxes for stuff that isn't a programming language.

Don't get me wrong it's an awesome tool and on top of that it's really fast, I just think that for a web facing thing it would be nice to have syntax highlighting for the kind of stuff you could have on a website, like for example:

xse@krkrkr ~ $ ls -l /usr/local/share/nvim/runtime/syntax/ | wc -l                                                             │
     660

PS: Still think tree-sitter is a nice replacement. I understand that the kind of tool able to do that might not be easy to deal with, and it's really not that hard to use :tohtml with a bit of sed to get ready to copy/paste html with inline css for all those things that are not programming languages.

Keats commented 1 year ago

The issue with syntect is that we are stuck with 2 years old buggy (the JS one for example can take forever to highlight a snippet) sublime syntaxes since they introduced new syntax not supported by syntect.

The choices are:

  1. stay on syntect + the current outdated syntaxes that kinda work (eg no async/await highlight in Rust for example)
  2. move to a pygment port in Rust for a much simpler highlight system (I started with that initially) but still based on regexes but easy to add support of many languages to it
  3. move to a tree-sitter based highlighter which gives us the same highlight as an editor with no regexes and (probably) much faster than syntect

I think 1 is a dead end in the long run as the Sublime Text people can keep changing their spec however they want. I've started porting pygments to Rust a while back and would be an easy solution for people wanting to add syntaxes since it could be just a yaml/toml file. It would also be annoying to use VSCode/Sublime themes as the scopes are very different.

Tree-sitter is nicer in that the highlights are much more accurate, it's easy to port TextMate themes and I wouldn't have to maintain it. It's harder to provider custom syntaxes like Zola currently allows though.

Keats commented 1 year ago

I've started using Helix themes and queries and the result is really good.

With their OneDark theme and the default Rust highlight query:

Screenshot 2022-08-30 at 21 15 50

With their OneDark theme and their Rust highlight query:

Screenshot 2022-08-30 at 21 15 55

The last screenshot is pretty much the same as opening that file in VSCode. Helix is a really great match as they have already a great collection of themes and a lot of improvements to the default queries. I'll see if they are ok with moving those bits out of the main repo for collaboration, otherwise it can be solved with copying and licensing.

Jieiku commented 1 year ago

Yes that bottom one does indeed look really nice!

the-mikedavis commented 1 year ago

Tree-sitter is capable of really nice syntax highlighting but there are some drawbacks to consider.

For the 109 languages supported in Helix, the total size of the compiled parsers is 108.5 MiB. Most compiled parsers are somewhere on the order of hundreds of KiB with some larger parsers on the order of ones or tens of MiB. The queries are altogether very small: only 1.7 MiB for all of them. The parsers are also C and many languages have C++ external scanners, so you would need to add compile-time dependencies on a C++ toolchain.

It's a large amount of work to add support for a language which doesn't have a tree-sitter parser yet. With regular expression based highlighting you can work incrementally - start with a few highlights and add more as you go - but it's hard to write a parser that incrementally covers the full syntax of a language. Language support has become very mature recently with tree-sitter though and there are even parsers for non-programming languages (I have a few for git commits, configs, rebase syntax, diffs).

We're happy to take those tradeoffs with Helix since tree-sitter can be used to build so many features (syntax highlighting, syntax-based motions, textobjects, indentation, rainbow brackets) but those tradeoffs are worth some consideration for Zola. There's a similar project which could be more appropriate: https://lezer.codemirror.net/ but admittedly I haven't used it and I think the language support is less full. Plus then the syntax highlighting would need to be done client-side.

All of that being said, I would really love to see tree-sitter syntax highlighting in Zola. At least selfishly since the Helix website uses Zola :)

Jieiku commented 1 year ago

I don't like solutions that require client-side highlighting (unnecessary JavaScript), the page would load significantly slower. I prefer a solution that makes efficient use of html/css to style the page. I went out of my way to make the back to top button CSS only for the abridge theme so that it would be one less JavaScript file. I am not completely against JavaScript, I make plenty of use of it in abridge, I just don't like using JavaScript when there is a more efficient way of solving a problem (page speed performance).

I wonder if Zola makes use of any other tools/libraries that are also C/C++, or if supporting Helix would be the first one?

Very cool that there is parsers for: git commits, configs, rebase syntax, diffs

Keats commented 1 year ago

Argh, I didn't know the parsers were that big :o. From Helix 22.05:

-rwxr-xr-x  1 admin    51K Aug 31 20:56 twig.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 iex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 eex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 regex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 gowork.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 gomod.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 embedded-template.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 json.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 gitignore.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 git-rebase.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 git-config.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 tsq.so*
-rwxr-xr-x  1 admin    53K Aug 31 20:56 toml.so*
-rwxr-xr-x  1 admin    54K Aug 31 20:56 comment.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 heex.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 cpon.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 gitattributes.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 graphql.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 meson.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 git-diff.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 dockerfile.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 devicetree.so*
-rwxr-xr-x  1 admin    85K Aug 31 20:56 nix.so*
-rwxr-xr-x  1 admin    90K Aug 31 20:56 html.so*
-rwxr-xr-x  1 admin    91K Aug 31 20:56 vue.so*
-rwxr-xr-x  1 admin   100K Aug 31 20:56 git-commit.so*
-rwxr-xr-x  1 admin   101K Aug 31 20:56 css.so*
-rwxr-xr-x  1 admin   102K Aug 31 20:56 tablegen.so*
-rwxr-xr-x  1 admin   110K Aug 31 20:56 svelte.so*
-rwxr-xr-x  1 admin   116K Aug 31 20:56 protobuf.so*
-rwxr-xr-x  1 admin   116K Aug 31 20:56 wgsl.so*
-rwxr-xr-x  1 admin   117K Aug 31 20:56 cmake.so*
-rwxr-xr-x  1 admin   134K Aug 31 20:56 fish.so*
-rwxr-xr-x  1 admin   134K Aug 31 20:56 gdscript.so*
-rwxr-xr-x  1 admin   166K Aug 31 20:56 nu.so*
-rwxr-xr-x  1 admin   181K Aug 31 20:56 ledger.so*
-rwxr-xr-x  1 admin   182K Aug 31 20:56 lua.so*
-rwxr-xr-x  1 admin   182K Aug 31 20:56 nickel.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 cairo.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 sql.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 make.so*
-rwxr-xr-x  1 admin   232K Aug 31 20:56 elm.so*
-rwxr-xr-x  1 admin   233K Aug 31 20:56 yaml.so*
-rwxr-xr-x  1 admin   245K Aug 31 20:56 go.so*
-rwxr-xr-x  1 admin   261K Aug 31 20:56 hare.so*
-rwxr-xr-x  1 admin   262K Aug 31 20:56 gleam.so*
-rwxr-xr-x  1 admin   280K Aug 31 20:56 python.so*
-rwxr-xr-x  1 admin   282K Aug 31 20:56 hcl.so*
-rwxr-xr-x  1 admin   294K Aug 31 20:56 llvm-mir.so*
-rwxr-xr-x  1 admin   294K Aug 31 20:56 java.so*
-rwxr-xr-x  1 admin   295K Aug 31 20:56 javascript.so*
-rwxr-xr-x  1 admin   310K Aug 31 20:56 r.so*
-rwxr-xr-x  1 admin   326K Aug 31 20:56 odin.so*
-rwxr-xr-x  1 admin   342K Aug 31 20:56 erlang.so*
-rwxr-xr-x  1 admin   358K Aug 31 20:56 c.so*
-rwxr-xr-x  1 admin   439K Aug 31 20:56 sshclientconfig.so*
-rwxr-xr-x  1 admin   456K Aug 31 20:56 rescript.so*
-rwxr-xr-x  1 admin   476K Aug 31 20:56 org.so*
-rwxr-xr-x  1 admin   519K Aug 31 20:56 glsl.so*
-rwxr-xr-x  1 admin   537K Aug 31 20:56 scala.so*
-rwxr-xr-x  1 admin   585K Aug 31 20:56 dart.so*
-rwxr-xr-x  1 admin   616K Aug 31 20:56 vala.so*
-rwxr-xr-x  1 admin   621K Aug 31 20:56 bash.so*
-rwxr-xr-x  1 admin   648K Aug 31 20:56 solidity.so*
-rwxr-xr-x  1 admin   717K Aug 31 20:56 php.so*
-rwxr-xr-x  1 admin   763K Aug 31 20:56 rust.so*
-rwxr-xr-x  1 admin   777K Aug 31 20:56 zig.so*
-rwxr-xr-x  1 admin   990K Aug 31 20:56 markdown.so*
-rwxr-xr-x  1 admin   1.0M Aug 31 20:56 ruby.so*
-rwxr-xr-x  1 admin   1.2M Aug 31 20:56 scheme.so*
-rwxr-xr-x  1 admin   1.2M Aug 31 20:56 julia.so*
-rwxr-xr-x  1 admin   1.4M Aug 31 20:56 typescript.so*
-rwxr-xr-x  1 admin   1.4M Aug 31 20:56 tsx.so*
-rwxr-xr-x  1 admin   1.5M Aug 31 20:56 llvm.so*
-rwxr-xr-x  1 admin   1.6M Aug 31 20:56 cpp.so*
-rwxr-xr-x  1 admin   1.6M Aug 31 20:56 latex.so*
-rwxr-xr-x  1 admin   1.7M Aug 31 20:56 elixir.so*
-rwxr-xr-x  1 admin   2.3M Aug 31 20:56 ocaml-interface.so*
-rwxr-xr-x  1 admin   2.7M Aug 31 20:56 ocaml.so*
-rwxr-xr-x  1 admin   2.9M Aug 31 20:56 haskell.so*
-rwxr-xr-x  1 admin   2.9M Aug 31 20:56 c-sharp.so*
-rwxr-xr-x  1 admin   3.3M Aug 31 20:56 perl.so*
-rwxr-xr-x  1 admin   3.5M Aug 31 20:56 swift.so*
-rwxr-xr-x  1 admin   3.6M Aug 31 20:56 kotlin.so*
-rwxr-xr-x  1 admin    15M Aug 31 20:56 lean.so*
-rwxr-xr-x  1 admin    18M Aug 31 20:56 verilog.so*

I'm not planning to add all of those but just the Verilog one is the same size as current Zola x) I've had a look at tree-sitter issue tracker and it does have some issues about generating huge parsers but the size of some grammar don't make too much sense to me and there seems to be a 50kb minimum size. The size of some of is pretty fishy, eg Markdown being 1M despite HTML being 90K. I don't care too much about the end binary size but that seems a bit extreme as an increase...

I wonder if Zola makes use of any other tools/libraries that are also C/C++, or if supporting Helix would be the first one?

Zola is super annoying to build on Windows because of libsass requirements. Definitely not the first one.

lf- commented 1 year ago

I have a working prototype of tree-sitter highlighting working for zola with Helix themes on branch https://github.com/lf-/zola/tree/tree-painter

It uses https://github.com/matze/tree-painter/ as a back end instead of the one @Keats was writing a couple months ago, just because it seems to have all the highlighting to HTML done already.

It's probably not upstreamable as is, containing a good many hacks, and also compiles the treesitter stuff statically which is more convenient but makes LTO infeasible due to absurd link times (I've not investigated how to selectively do LTO). Feel free to take any amount of it that you'd like; I don't have resources to clean it up to upstream it.

Also, the perf is Not Good. Even with the LTO build I had, my site build went from 36ms to 800ms. I don't know why it's slow, and probably the best way to figure that out is to instrument Zola with tracing, which is something that I don't have resources for as the bad perf is not bad enough to motivate doing it for my site.

Note This perf issue is almost certainly not due to tree-sitter itself being bad, but instead something very silly happening in rust land. For instance, the standalone tree-sitter tool takes 3s to parse the entirety of compiler/ in GHC, 400k lines of haskell. On one thread. Although that doesn't involve running queries, so maybe that's the bottleneck? Anyway it probably needs profiling.

I don't have the same constraints as Zola is designed for (binary size does not bother me, generation time does not bother me as long as it's not a workflow blocker), and I've got it good enough to power my site, so I'm stopping where I got to.

Regarding the highlight groups, it's distinctly possible that the different clients are using different queries. The treesitter parsers often come with highlight queries, but nvim-treesitter seems to vendor theirs. My speculation is that a big reason for this is that nvim-treesitter has some nonstandard features such as the (#make-range! ..) "predicate", as well as the @spell capture group.

The way that I've debugged these is by using my nvim which has nvim-treesitter-playground installed and using :TSHighlightCapturesUnderCursor on the syntax in question, and comparing it to the CSS output of tree-painter.

Anyway, good luck! Good highlighting is really important to programming blogs, and I almost got rid of Zola over it before realizing it was probably easier to hack it in instead.

Sample:

Before (notice that some -- comments are completely misparsed):

image

After:

image

lf- commented 1 year ago

One difficulty with any form of tree-sitter integration is that building parsers is a nightmare due to Cargo being quite very bad at submodules. More details here: https://github.com/matze/tree-painter/issues/3

This could be done either statically linked or dynamically linked, but I would lean toward dynamic linking since it is otherwise impossible to add more parsers to the system without forking it.

But dynamic linking would compromise the current single-executable nature of Zola (not something that I'm bothered about, but I understand it is a design goal).

Keats commented 1 year ago

I can't build your fork for some reason on rustc 1.64 or nightly. I'll have a deeper look when I get more time. Can you tell me how big is the generated Zola binary?

the perf is Not Good. Even with the LTO build I had, my site build went from 36ms to 800ms.

That's surprising. Sounds like something being instantiated too often? I'm expecting the tree-sitter parsers themselves to be faster than tons of regexes from syntect.

This could be done either statically linked or dynamically linked, but I would lean toward dynamic linking since it is otherwise impossible to add more parsers to the system without forking it.

It's the issue yes. If the generated parsers size was manageable, we could just add everything to the library. Of course that's not going to work for home-made languages but hey...

lf- commented 1 year ago

The build issues are for utterly unknown reasons to me. I had them when i used the git source with cargo but not when i used the override in the top level cargo manifest to point to my local clone.

The way i built it was with a local recursive clone of tree-painter next to my zola directory. This is significantly faster also because i configured all the submodules to do shallow clones, bypassing cargo's terrible submodule handling. I again have no idea why cargo picked bad versions of submodules when it was the one doing the clone.

lf- commented 1 year ago

With respect to the perf issues, one cause is that the renderer is reinitialized for each and every code block, in my hacks (since one single one can't be shared cross thread) but it's unclear to what extent that's driving bad perf on account of it probably only reparsing the theme rather than doing anything truly bad.

Keats commented 1 year ago

The tree-painter cli is 24MB with the following langs:

    # "tree-sitter-bash",
    "tree-sitter-c",
    "tree-sitter-c-sharp",
    # "tree-sitter-clojure",
    "tree-sitter-cpp",
    "tree-sitter-css",
    "tree-sitter-dockerfile",
    "tree-sitter-go",
    "tree-sitter-haskell",
    # "tree-sitter-html",
    "tree-sitter-java",
    "tree-sitter-javascript",
    "tree-sitter-json",
    "tree-sitter-julia",
    "tree-sitter-kotlin",
    "tree-sitter-latex",
    "tree-sitter-lua",
    "tree-sitter-md",
    "tree-sitter-nix",
    "tree-sitter-ocaml",
    # "tree-sitter-perl",
    # "tree-sitter-php",
    "tree-sitter-python",
    # "tree-sitter-ruby",
    "tree-sitter-rust",
    "tree-sitter-scala",
    # "tree-sitter-swift",
    "tree-sitter-typescript",
    "tree-sitter-zig",

Not terrible but not great.

My point of view on the choices:

tree-sitter

Pros:

Cons:

Pygments

See https://pygments.org/demo/ for a demo

Pros:

Cons:

lf- commented 1 year ago

A thought I have is that the tree sitter parser building stuff from Helix should actually be extracted into a separate package, which provides all of the build methods to keep everything packageable feasibly in dev plus in Nix/OS packaging with tarballs plus in advanced Nix setups (with the parsers provided by an external build system).

That would make it maybe possible to put zola on crates, but it's unclear what runtime directory should be used for the dynamic libraries in such a case (I am plainly unfamiliar with how cargo install works in great detail).

Using dynamic libraries would require more files but it would enable adding more parsers.

Jieiku commented 1 year ago

Seems the main Con with Pygments is the less precise highlighting. I would assume this can be tweaked as needed through the grammars? or are there limitations that simply cannot be overcome with a rust implementation of pygments?

The way rust code looks in the pygments demo is not necessarily how it would look in your implementation if you changed the grammar?

tree-sitter does sound great but also more complicated to implement and maintain.

I am going to assume there would be more community fixes and support with the pygments approach simply because it is easier to extend.

I am inclined to say pygments might be the better choice, but you have obviously worked with both at this point Keats and likely know better than anyone else what would work best.

lf- commented 1 year ago

The chief and largest problem with Pygments as far as I can tell is that it doesn't solve the problem set out: as far as I can tell it's just less-buggy regex highlighting, which poses a problem when sickos design programming languages that aren't representable as regexes. Unlike tree-sitter, it doesn't increase the maximum possible quality of a parser.

There's a lot of effort behind the tree-sitter project as it powers production grade syntax highlighting for several editors.

I don't, however, have a horse in this race, since I'll either switch from zola or keep using my fork.

Jieiku commented 1 year ago

I honestly had no idea that there are things that cannot be solved with regex, given that we define what type of code a given block is. I consider myself an advanced regex user and there is usually a way to make a matching pattern unless I am still not understanding something.

As you pointed out though tree-sitter does sound highly polished.

I am kinda surprised there are not more people discussing this, with why they think this or that is better for a given reason.

@Keats on this point: Fixing bugs in grammars is not trivial and Using a custom grammar is not going to be supported.

If bugs got fixed or additional grammars were added in the upstream tree-sitter project, those changes would become part of zola correct? (If so then so long as tree-sitter stays well supported then maybe it is not an issue.)

Jieiku commented 1 year ago

There's a lot of effort behind the tree-sitter project as it powers production grade syntax highlighting for several editors.

@lf- I am trying to find a list of editors that use tree-sitter but so far no luck, if you know of some of them please let me know.

lf- commented 1 year ago

Helix, neovim (part of the support is in a plugin but the rest is in core), vscode with a plugin, and atom (probably; the editor was discontinued) are the ones i know about from memory. There's probably more.

Jieiku commented 1 year ago

I was hoping maybe kate was supported with a plugin or something, ran across this post: https://www.reddit.com/r/kde/comments/mfy15q/kate_2104_feature_preview/

There was an interesting reply by ChristophCullmann regarding tree-sitter (granted it was 2 years ago)

Kate is the editor I generally use.

Edit: https://tree-sitter.github.io/tree-sitter/ you can scroll down to "Parsers for these languages are in development:" to get a list of languages that are still considered in development, so it seems the perl language has been in development status for atleast 2 years.

I now understand what @Keats meant by "Fixing bugs in grammars is not trivial"

Maybe with Zola using tree-sitter, the tree-sitter project would get even more attention and some of those in development languages could get finished faster, but there is really no guarantee of that.

All the languages that I personally use seem to be supported though, so I guess that point is moot for me, but possibly not moot for other people.

edit2: the link Keats put in the OP actually lists more languages https://github.com/nvim-treesitter/nvim-treesitter#supported-languages

Keats commented 1 year ago

The way pygments works, you can't really improve the highlighting compared to what you see on the website. It's just a very basic regex system that has almost no context. Have a look at https://github.com/pygments/pygments/blob/master/pygments/lexers/rust.py to see how simple it is. Mostly a list of keywords and builtins and some regex to handle strings/comments etc. Next have a look at the ones from Sublime which is what we're using (although an older version of the grammar): https://github.com/sublimehq/Packages/blob/master/Rust/Rust.sublime-syntax tree-sitter works with actual parsers and a combination of highlight queries for specifying what you want to highlight.

ttys3 commented 1 year ago

Helix, neovim (part of the support is in a plugin but the rest is in core), vscode with a plugin, and atom (probably; the editor was discontinued) are the ones i know about from memory. There's probably more.

lapce (https://lapce.dev/) https://github.com/lapce/lapce/blob/master/lapce-core/Cargo.toml

only have a single file: lapce, 82.4MB

zed editor (https://zed.dev/tech): Rust + tree-sitter (currently not opensourced, latest version is Zed v0.60.4)

Tree-sitter We plan to integrate with the Language Server Protocol to support advanced IDE features, but we also think it's important for a code editor to have a rich, native understanding of syntax.

That's why we built Tree-sitter, a fast, general, incremental parsing library that can provide Zed with syntax trees for over 50 languages. Tree-sitter already powers production functionality on GitHub, and we'll use it to deliver syntactically-precise syntax highlighting, tree-based selection and cursor navigation, robust auto-indent support, symbolic navigation, and more.

the Zed.dmg file size is about 77.4MB

lf- commented 1 year ago

I am currently hacking on pulling the treesitter grammar builder out of Helix, and this is the result for stripped parsers compiled in optimized mode:

dev/tree-sitter-grammars - [main] » ls result/ | wc -l
114
dev/tree-sitter-grammars - [main] » du -shL result/          
16M     result/

It seems that the binary size is not actually anywhere near as bad as expected, when compiled as shared libraries at least.

Keats commented 1 year ago

That's pretty sweet. I think tree-sitter is probably the way to go. I'll continue on giallo when I get some time. It does suck for people with custom grammars though but at some point, we can't support everything :/ We could have a syntect fallback but the themes are not compatible with each others so that would be weird.

evtn commented 1 year ago

Hello, I'm trying Zola out right now for a static site with a lot of custom highlighting and I've come across this

I've had some experience with tree-sitter before and I can say that writing custom grammars for it is a pretty doable task, so I don't really follow how integrating tree-sitter would limit custom highlighting, can someone explain it better?

Keats commented 1 year ago

so I don't really follow how integrating tree-sitter would limit custom highlighting, can someone explain it better?

We would need to load the C parsers dynamically. Not a huge deal, helix is doing it but it's additional work which I won't do personally. Someone could definitely come and implement it though.

ulope commented 1 year ago

A (potential) user's perspective: Having the option of easily adding custom / 3rd party highlighters is one of the things that made me consider zola as a replacement for my current mix of (mostly) Pelican and Lektor (both Python) in the first place. If this were to be removed it would be a no-go for me.

Switching to an entirely new parser (with all the downsides listed above) instead of fixing the (apparently one) missing feature in the existing one seems like quite the "throwing the baby out with the bathwater" reaction.

Keats commented 1 year ago

fixing the (apparently one) missing feature in the existing one seems like quite the "throwing the baby out with the bathwater" reaction.

The missing feature implementation is unlikely to be implemented I think? It's been an issue since 2019 on the syntect repo and many lang syntaxes have been rewritten to use it. It would be much more work to implement it than to switch to something like tree-sitter. At the same time, if someone updates syntect to implement all the new features from Sublime since 2019, I would happily use that. It's not going to be me though.

Keats commented 1 year ago

https://github.com/tree-sitter/tree-sitter/issues/1942 is probably a blocker though

apiraino commented 1 year ago

It seems that the binary size [of the treesitter grammars] is not actually anywhere near as bad as expected, when compiled as shared libraries at least.

If my understanding of how tree-sitter is deployed (i.e. one binary or compiled as lib + grammar files) is correct, I would also argue that binary size is IMHO relatively not so impacting. Once I have tree-sitter deployed for my Vim/Emacs/whatever, it's a single installation serving many purposes.

Keats commented 1 year ago

Another highlighter in JS: https://shiki.matsu.io/ We could likely port it to Rust and switch to VS Code themes?

@the-mikedavis do you have some perf check in helix on how much time it takes tree-sitter to load the syntaxes?

the-mikedavis commented 1 year ago

It's not something we track actively but I put together some rough numbers for a few languages:

Language Parser loading time (µs) Query loading time (µs)
HTML 315 95
TOML 145 133
CSS 170 480
Go 205 3,300
Markdown (block) 960 6,300
C 240 8,300
Markdown (inline) 300 14,600
JavaScript 285 23,700
C++ 385 37,600
Rust 280 54,000
Elixir 345 203,100
Swift 430 558,700

These are running in release mode. Loading the parser is pretty consistently fast but creating the queries (Query::new in the tree_sitter crate) can take a surprising amount of time for some outlier languages like Elixir and Swift. https://github.com/tree-sitter/tree-sitter/pull/1589 improved this a lot though so the query analysis times are pretty sane for even the outlier languages.

Keats commented 1 year ago

Hmm still, 560ms to load just the Swift query is way too much. Most Zola sites render faster than that with syntect highlighting :/

Thanks for checking it!

Keats commented 1 year ago

With that in mind, I'm thinking maybe the shiki approach is better as I'm expecting a textmate parser to be kinda simple and we get all the languages/themes of VSCode for free.

90% of the work is porting https://github.com/microsoft/vscode-textmate to rust but that seems doable:

(venv) ~/C/p/vscode-textmate (main|✔) $ tokei src/
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 TypeScript             30         7301         5878          471          952
===============================================================================
 Total                  30         7301         5878          471          952
===============================================================================
uncomfyhalomacro commented 1 year ago

Has anyone noticed this https://github.com/edg-l/treelight? it's fairly new, and I haven't tested it yet

lf- commented 1 year ago

I suspect it has the same results as tree-painter, and it unfortunately has the same issues with cargo and submodules that make it challenging to use.

One thing that could be done perhaps is to use wasm for the tree sitter parsers: this may make the deployment less of a headache since you could just stuff them into your binary. But I'm not sure if anyone has made the right shape of thing for this. https://github.com/tree-sitter/tree-sitter/blob/master/lib/binding_web/README.md

Keats commented 1 year ago

Doing tree-sitter -> HTML is trivial. The main issue is loading the syntaxes as mentioned above