Support Sublime Text syntax definitions

Summary

Support Sublime Text syntax definitions for highlighting, alongside tmLanguage and tree-sitter.

Motivation

There are many high-quality sublime-syntax definitions out there. In addition, the migration path from tmLanguage to sublime-syntax is easier than from tmLanguage to tree-sitter.

Describe alternatives you've considered

I'm not currently an Atom user. If this could be implemented in a third-party package rather than in core, then please let me know. My limited understanding suggests that this would need to be implemented at least partially in the core.

Additional context

I'm a Sublime user and package developer. Among other things, I mostly rewrote Sublime's JavaScript syntax definition and I am currently its primary maintainer. I also wrote a proof-of-concept Node.js implementation of the sublime-syntax engine. I mention this to indicate that this is not an idle suggestion and that I am aware of the amount of work that would go into such an implementation.

I don't expect that this is something that the core devs would pick up. If there is substantial interest in the idea, and there is no philosophical objection, then I will look into what is required. If it makes sense, I would be willing to work on this, though given my unfamiliarity with Atom's internals I would probably need to work with someone more experienced in such matters.

This is something I have had in my to-do list for at least two years by now. Unfortunately I never got around to ~reverse engineering~ understanding the parsing behavior.

It should be possible to build the additional behavior on top of first-mate API-s (node package that does the heavy lifting for Atom's TextMate-styled grammars), so very few if any modifications are required on Atom's side. Depending on how complete is your proof of concept implementation, sorting out the first-mate parts should require relatively low effort as well. But if you are just jumping into it alone, it will be difficult at first regardless.

There should be no philosophical objection and I would not be surprised if even Microsoft suddenly decided that they needed to have support for those grammars in vscode too after Atom gets them.

It is fairly unlikely that a core developer will pick this up, however there are enthusiastic community members who can provide any information that you require. Core devs typically also respond fairly quickly to any technical questions, given that they have the experience working on that component.

I'm going to ping @50Wliu, so that he would see this and perhaps can refer more people that would be interested in this.

Basically how TextMate grammars are currently consumed is via this first-mate API:

const {tags, ruleStack} = grammar.tokenizeLine(text, currentRuleStack, currentLineIsFirstLine)

currentRuleStack just being an internal representation of the current state (from the ruleStack result of previous line) to tokenize the next line. It is pretty much blindly passed back to the function that returned it. Benefit of this is that we can skip lines preceding the modification by passing the state of the line where the modification was made. Grammar should not hold any state regarding any specific document.

tags are Atom-specific way of encoding segments of text into integers for performance reasons. Odd/even negative integers show where a tokenized scoped name starts/ends and positive integers tell how many character positions to move forward.

There is another function to transform the negative integers back to scope names, which needs to be built only once when loading in the grammar.

If you can tweak your current proof of concept to follow that API or make it so that it is possible adapt to it, figuring out rest of the details should be fairly straightforward regardless of how Sublime Text grammars internally work as most of it has to do with (de)initialization.

Technically you are not limited to that API and can do everything in Atom as well through defining your own language mode, but I think it might be easier to just modify or imitate what we have for TextMate (at least in order to get something working) as the input/output for both grammar systems should be very similar.

Thanks for the suggestion. This is a duplicate of the discussion in https://github.com/tree-sitter/tree-sitter/issues/139. We don't feel that adding a third grammar system to Atom is going to be beneficial to the ecosystem as a whole. Additionally, while it is very gracious for you to offer to implement it alongside us, the Atom team would have to maintain said system in perpetuity and maintaining two grammar systems is more than enough for us at this point and for the foreseeable future :grinning:

Thanks again for your submission and your passion about Atom.

As one of the participants in that discussion, I don't think that it's about the same thing at all. The question there was whether tree-sitter could be made to support sublime-syntax definitions. The answer to that question is clearly "no". The discussion then derailed into a debate over the technical merits of the two systems. There was no discussion there about whether adding sublime-syntax compatibility is desirable. In fact, when searching prior to creating this issue, I could not find any existing issues that addressed that point. If I have missed such a discussion, then I would appreciate it if you would post a link.

Frankly, I don't think that anyone has examined the benefits and drawbacks to supporting sublime-syntax. I don't know what the maintenance costs would be because of my unfamiliarity with Atom's internals. I don't think that any of the Atom devs can say what the maintenance cost would be because of their unfamiliarity with sublime-syntax. I admit to being optimistic, because the "engine" of sublime-syntax is a fairly dumb automaton runner, just like tmLanguage; I expect that tmLanguage and sublime-syntax definitions can share a single implementation.

At this point, there seems to be interest in the idea and no known technical barriers to implementation. The objection you raise -- "We don't feel that adding a third grammar system to Atom is going to be beneficial to the ecosystem as a whole." -- is one that I can't understand because I have no idea where it comes from. Is this a decision that is documented somewhere? Did it come out of a public discussion?

I don't know what the maintenance costs would be because of my unfamiliarity with Atom's internals. I don't think that any of the Atom devs can say what the maintenance cost would be because of their unfamiliarity with sublime-syntax.

I have an intimate familiarity with the cost of maintaining the Atom grammar systems and supporting multiple official grammars, having been involved in the maintenance of Atom for the past few years and coordinating a lot of our triage work for the past two. I also have a familiarity with Sublime's syntax system, having been a Sublime Text user before becoming an Atom fan and eventually maintainer. Additionally, I produced my own grammars for Sublime Text when I was a user. So I feel I have a reasonably good grasp of what it would take to maintain an additional grammar system within Atom, even one like Sublime Text's.

Regardless, what you are asking is for the Atom maintainer team to take on work. This is work that we do not have the resources to do or commit to long-term. If that changes, we may re-evaluate the situation. Until it does, our answer is going to remain unchanged, no matter how eloquent the argument.

On the other hand, because Atom is open source, you and the other interested parties are free to fork Atom, implement the additional grammar system or even replace the ones we maintain altogether, and prove our estimation incorrect.

At this point, there seems to be interest in the idea and no known technical barriers to implementation. The objection you raise -- "We don't feel that adding a third grammar system to Atom is going to be beneficial to the ecosystem as a whole." -- is one that I can't understand because I have no idea where it comes from.

I guess this could be answered with:

Atom team would have to maintain said system in perpetuity and maintaining two grammar systems is more than enough for us at this point and for the foreseeable future

As commented on the forum:

@fred.curts Atom is moving to tree-sitter and no longer maintains its TextMate grammars. VSCode is stuck on TextMate grammars, which all the horrors that this entails (>500 issues reported for syntax highlighting of TypeScript alone (!!!), ever more issues filed for TextMate grammars used by VSCode and no longer maintained by Atom). Sublime its betting on its own (from what I understand proprietary) highlighting system. As a language author, I can't and don't want to maintain three different grammars. https://forum.sublimetext.com/t/tree-sitter-support/40559/14

While it is indeed very neat/cool/awesome the possibility to only write a single grammar, and all major text editor or IDE's would have support for our infiniteless and different languages out there. But, why should Sublime Text team alone decide/dictate for everybody what is the course/heading the .sublime-syntax files should have?

While Sublime Text syntax system is a proprietary and closed source in development, Atom is a open source project which will have to keep up and tolerate all decisions Sublime Text team decides to make, changing the sublime-syntax in future updates.

Why Atom should be bossed by a proprietary company which just takes his decisions in order to stay in the market, bringing more license buyers other than help old and dedicated developers with very old and annoying and bugs? I am not saying Sublime Text decisions are wrong, otherwise, they probably would not have enough money to keep moving on Jon Skinner private pet project forward. There are more discussions about this on this forum thread: https://forum.sublimetext.com/t/sublime-text-versus-visual-studio-code-in-2019/41375

Why all editor and IDE's should blindly accept Sublime Text team decisions? If Sublime Text team decisions about .sublime-syntax are not collectively heard and pondered with equally vote weight from every major text editor, the only thing we will have are several and countess text editors implementing their own variations of .sublime-syntax , which will never be 100% compatible between each other. So, either way, there is not much point in having all editors and IDE's using .sublime-syntax if everybody have to blindly (Black Box) be reimplementing their own understanding of the .sublime-syntax engine, because they cannot see the source and have to guess through documentations how it was supposed to be best/effectively/efficiently implemented.

Unless Sublime Text team decides to open source the Sublime Text syntax system, or form some sort of an organization/group/movement, dedicated in hearing out and equally deciding together with all major text editors and IDE's, how the .sublime-syntax system should be moving forward, and how it should/could/will benefit everybody other than only Sublime Text team/company. From my experience, I would say Sublime Text team is not big enough for such organization and they seem to be very headstrong/stubborn into making changes.

Is this a decision that is documented somewhere? Did it come out of a public discussion?

This seems just rude. @lee-dohm as leader for the Atom Community, pondered this information over the countless years he have been dedicating for Atom. But, he may be mistaken, and could be enough people committed into implementing the proprietary .sublime-syntax into the Atom editor. If you would like to know whether @lee-dohm is wrong or not, just open a pool on the Atom Forum: https://discuss.atom.io/u?period=all

Versus

https://forum.sublimetext.com/u?period=all

I didn't come here to start a fight, nor is there any reason for one.

No way. We are not fighting, we are just discussing what we do believe, or at least, think we believe. With this workflow:

I say somethings
You read them
And either get very angry or understand what I am saying.
If you get very angry:
1. And reply "-You know what?! You are a son ** *******! When you were born, your ***** was so small, the doctor though you were a girl!"
  - And I would reply "A son ** ******* it is definitely you! And when you were a kid, you was so ugly your mother had to pin a steak on your neck for your own dog play with you!"
2. Then, we are fighting. Otherwise, you could get very angry and unsubscribe from the topic and never reply any of my comments anywhere in the world.
3. Then, we are draw/fought for ever and ever.
Otherwise, if you get very angry and reply Your post was very rude because you do that thang. And this thing you say does not make sense to me because if the man had even gone to the moon, we would see his footsteps on the moon surface.
Then, we are discussing. And, either I would:
1. Reply nothing, because your post was well explained and I understood and have nothing more to add other than failed sorry.
2. Reply a big sorry because I had said a very big scurrilous post.
3. Otherwise, I could get very angry and unsubscribe from the topic and never reply any of my comments anywhere in the world.
4. Then, we are draw/fought for ever and ever.
Then, we can keep this reply cycle until everything was understood. Otherwise, either I would get tired of discussing and stop replying, or either you would get tired of discussing and stop replying.

I didn't come here to start a fight, nor is there any reason for one. Nor do I have any problem with tree-sitter, nor do I think that sublime-syntax should replace any existing system.

Nor do I believe that implementing sublime-syntax means that Atom would be “bossed around” by Sublime. This is the kind of philosophical objection I was concerned about. But the objection instead seems to be resources, which is not insurmountable.

@Thom1729

But the objection instead seems to be resources, which is not insurmountable.

Resources spent on this are resources not spent on other things. The cost is important to consider, especially if the idea is to have Atom support it in core. However, it should technically be possible for a community package to implement whatever grammar engine they like; so long as it looks and acts like a language mode. The stability of this interface is questionable, as I believe it was only recently introduced to allow a uniform interface to Tree-sitter and TextMate grammars.

Nor do I believe that implementing sublime-syntax means that Atom would be “bossed around” by Sublime.

I believe the point being made here was that supporting Sublime grammars requires knowing the implementation details, which are hidden under proprietary code. While the docs describe how properties are supposed to work, poorly / undocumented edge cases are very likely to exist. I don't find poking a black box to see what happens a good way of discovering the implementation. Additionally, if Sublime text changes any implementation details, Atom would be forced to change as well or be stuck with a broken grammar system.

I hope this shows why it's not desirable to have this in Core. Like I said though, a community package should be able to do it anyway. I'd be willing to try it out, but normally I'll be using Tree-sitter grammars where possible anyway (they offer better performance and parsing information, at least compared to TextMate grammars, which I think Sublime grammars are based on). In fact, Atom adopting Tree-sitter is probably the biggest reason I have no real interest in Sublime grammars.

On an unrelated note, does your proof of concept support incremental parsing?

However, it should technically be possible for a community package to implement whatever grammar engine they like; so long as it looks and acts like a language mode. The stability of this interface is questionable, as I believe it was only recently introduced to allow a uniform interface to Tree-sitter and TextMate grammars.

That sounds like it should be fine. I'm not actually attached to the idea of integrating this into core; I just assumed that I would need to. If Atom's API permits it, a separate package would be perfectly acceptable.

I believe the point being made here was that supporting Sublime grammars requires knowing the implementation details, which are hidden under proprietary code. While the docs describe how properties are supposed to work, poorly / undocumented edge cases are very likely to exist.

True, which is why I didn't create this issue until I'd already created a working implementation. It's faithful enough (and tested enough) that I discovered hitherto unreported bugs while implementing it. I wouldn't suggest implementing it outside Sublime if I weren't confident that I could create an exact replica, even down to the tokenization behavior.

I don't find poking a black box to see what happens a good way of discovering the implementation.

You and I have different ideas of fun. 😀

Additionally, if Sublime text changes any implementation details, Atom would be forced to change as well or be stuck with a broken grammar system.

Only insofar as Sublime would also be stuck with a broken grammar system (for users on the previous version) and/or hundreds of broken third-party syntaxes. I would be very surprised if Sublime made a backward-incompatible change.

[Tree-sitter grammars] offer better performance and parsing information, at least compared to TextMate grammars, which I think Sublime grammars are based on

There's a moderately detailed discussion of this in https://github.com/tree-sitter/tree-sitter/issues/139. The two systems are essentially comparable in expressive power. Tree-sitter supports nondeterministic context-free languages, whereas sublime-syntax only supports deterministic context-free languages, but this is not an essential defect; it could be remedied with a simple extension. On the other hand, sublime-syntax allows the use of arbitrary Oniguruma regexps, which lets it parse some non-context-free constructs (notably, heredocs). I understand that tree-sitter uses per-language C extensions to bridge this gap.

As far as performance goes, I would expect a sublime-syntax implementation to be faster than a tree-sitter implementation in real-world usage, all else being equal. I also expect that all else will not be equal, and that an optimized C implementation of tree-sitter will outperform a JavaScript implementation of sublime-syntax. Perhaps at some point the core sublime-syntax parser could be C-ified, and then we'd get to have a proper race (so to speak).

On an unrelated note, does your proof of concept support incremental parsing?

The proof of concept is basically just a parser; it's not in a good form to be hooked up to things. But incremental parsing is trivial with the sublime-syntax architecture because the parser does not look over line boundaries.

N.B.: Sublime actually has a standard syntax test framework. IIRC, the core JavaScript syntax has over four thousand assertions. In addition, I have scripts to do a complete character-by-character dump of a sublime-syntax parse, and the proof of concept compares its output to those dumps. This is one of the reasons I'm confident that I've nailed down the parsing behavior.

@Thom1729

If Atom's API permits it

About that...

I wasn't suggesting an API (though it would be an interesting idea), I imagined just replacing the languagemode property of the TextEditor. It wouldn't be an optimal way to do it, and I'm not entirely sure it's possible. Looking at the definition, there are a few interconnected parts. However, it should work the same as changing the grammar manually, so I'd check the grammar selector package to see if it does anything special.

The stability comment is fair enough I guess. At least on Atom's end, stability is prioritised over fixing clearly buggy behaviour. There's an issue with TextMate that I really don't like but has been "grandfathered" in by enforcing TextMate compliance.

I'm not sure where your conviction that Sublime will be faster than Tree-sitter comes from though, or if they are really comparable in the extent of what they do (tokenising line by line vs a concrete syntax tree). TextMate already uses JS Oniguruma regexes, and is significantly slower than Tree-sitter, as well as being unable to handle long lines efficiently. Is there something about the Sublime grammars that avoids this?

I'm not sure where your conviction that Sublime will be faster than Tree-sitter comes from though, .... TextMate already uses JS Oniguruma regexes, and is significantly slower than Tree-sitter, as well as being unable to handle long lines efficiently. Is there something about the Sublime grammars that avoids this?

Sublime's implementation uses two regexp engines. Most of the time, it uses a proprietary engine that supports a regular subset of Oniguruma. The impression I have is that this proprietary engine is implemented with automata (not backtracking) and that it compiles a whole context's worth of regexps into a single automaton. This allows guaranteed linear-time execution (really just a pointer lookup per character, though they probably use a more memory-efficient representation). In order to support non-regular regexps with backreferences and such, Sublime falls back to Oniguruma. This is, of course, slower, but it adds real expressive power.

Of course, an Atom implementation of sublime-syntax probably would not take advantage of these optimizations, and consequently would presumably be slower than tree-sitter.

...or if they are really comparable in the extent of what they do (tokenising line by line vs a concrete syntax tree)

What sublime-syntax does is fundamentally no different from what tree-sitter does. They're just two different paradigms for parsing context-free languages. Tree-sitter is undeniably more complex and sophisticated, so it makes sense that it should be more powerful, but their paradigms are computationally equivalent (context-free grammars are equivalent to pushdown automata). The sole differences are that sublime-syntax on one hand does not support nondeterminism (though it could), but on the other can use non-regular regexp features without jeopardizing its invariants.

A sublime-syntax doesn't look like a "real grammar" because it's not a grammar at all -- it's the kind of automaton that a grammar would be compiled to.

After looking through some of the docs, and after processing the advice in this issue, it looks like the right way to go would be to built an interface modeled after first-mate. The three major points of integration would be:

Registering sublime-syntax definitions. (Not sure how Atom finds language definitions.)
Loading an embedded syntax (specified by a scope like scope:source.js or a Sublime resource path like Packages/JavaScript/JavaScript.sublime-syntax).
Using the "grammar" to highlight a view. TextMateLanguageMode looks like a good model for this. In fact, it looks like that mode doesn't actually care how the grammar works or what the stack entries look like, so we probably could just use TextMateLanguageMode as written.

@Thom1729 I've been looking at the grammar repository logic for an unrelated reason, but it's not very extension friendly right now. Basically, if not grammar instanceof TreeSitterGrammar then it's assumed to be a TextMate grammar. So yeah, it's probably easiest to make it look like a TextMate grammar.

Still, the grammar registry looks relatively easy to change, as I believe only a single global one is used (atom.grammars). A community package could try reimplementing the relevant functions, to check if it's a Sublime grammar, do special stuff if so, or run the original method if not.

I'm at least interested to see what it will be like now. My view is I don't know how useful it would be, but it should at least be possible without excessive hackery or requiring users to compile a modified Atom.

On a related note, I think making a robust, generalised grammar repository would be a great change. All grammars would need a standard interface (for something to default to), but would allow compatible packages to take advantage of any additional features offered by the engine. Basing the interface on TextMate grammars, like I think Tree-sitter does, should aleviate any backwards compatibility concerns.

I can't do it any time soon, but I'd definitely support a well documented, tested, and thought out PR. Making a community package proof of concept probably comes first though.

In the end (i.e. once there's a working system), it looks like the only change to core would be to that grammar registry, adding a case for sublime-syntax. Even better than that would be slightly refactoring createGrammar so that a package could add its own type. This would be a very small change, and it would not tie the core to sublime-syntax in any way. (I think this is probably what you mean by a generalized grammar repository.)

On a related note, it looks like Atom doesn't actually support tmLanguage, but that tmLanguage files must be converted from XML plists to CSON in order for Atom to use them. Is this correct, or am I missing something? If this is so, then presumably sublime-syntax files would need a similar conversion from YAML to CSON.

The proof-of-concept sublime-syntax implementation transforms the syntax definitions in several ways before parsing so that the parser has as little work to do as possible. The result of these transformations really should be cached on disk. Is there an idiomatic way to do this in an Atom package? If not, then we may as well just use that final representation as the cson.

On a related note, it looks like Atom doesn't actually support tmLanguage, but that tmLanguage files must be converted from XML plists to CSON in order for Atom to use them. Is this correct, or am I missing something? If this is so, then presumably sublime-syntax files would need a similar conversion from YAML to CSON.

XML plists are quite verbose and CSON at the time was the hot new thing... I think it just made more sense from a maintenance standpoint (and maybe trying out something new), but YAML is comparable to CSON in terms of expressiveness. In my opinion it would be more appealing if any sublime package could be made Atom compatible without having to modify the grammar source files: by simply creating a package.json file, etc.

The proof-of-concept sublime-syntax implementation transforms the syntax definitions in several ways before parsing so that the parser has as little work to do as possible. The result of these transformations really should be cached on disk. Is there an idiomatic way to do this in an Atom package? If not, then we may as well just use that final representation as the cson.

For grammars there doesn't exist anything like that, however a similar system to transpiled source files (from CoffeeScript / Babel to ES5) could be implemented. So basically the mechanisms exist, but perhaps not for this exact purpose yet.

This issue has been automatically locked since there has not been any recent activity after it was closed. If you can still reproduce this issue in Safe Mode then please open a new issue and fill out the entire issue template to ensure that we have enough information to address your issue. Thanks!

atom / atom