feat: syntax highlighting for .bazelrc

adam-azarchs commented 2 months ago

The main purpose here is really to prevent vscode from autodetecting .bazelrc as a shell file, resulting in complaints about syntax that isn't valid in a bash script, or mistakenly highlighting e.g. variable expansions as if they were actually honored in a .bazelrc.

We can also do a bit better, by actually understanding the semantics of e.g. config specifiers and so on.

This does not attempt to handle full bourne-shell quoting semantics. It will only handle quotes around the entire flag, e.g. "--foo=some value" or around the flag value, e.g. --foo="some value". In practice, I doubt there are many cases where anyone would want to do something other than that.

This does not try to exhaustively match the list of all possible flags, and so it can make mistakes like treating --copt -O3 as if it were two separate flags rather than a flag + value, however that's mostly harmless. Maintaining a list of known flags would be prohibitively difficult to maintain without some kind of automation.

There is special treatment for a couple of flags, including --config and a few flags that expect regex arguments, like --instrumentation_filter.

Closes #259

Choices of scope names were somewhat arbitrary, since the standard scope names don't really match up with the semantics of a .bazelrc file. I mostly tweaked them to work well on this test example.bazelrc and the vscode default dark theme:

Unit testing for syntax grammars is not particularly easy; I was using https://github.com/PanAeon/vscode-tmgrammar-test during development, as a snapshot test with the same example.bazelrc.

vogelsgesang commented 2 months ago

Thanks for this Pull Request! It looks already really good!

I quickly checked this on a file with a couple of edge cases. It seems the following cases are not handled, yet:

multiline commands, i.e. a backslash at the end of the line
escaped whitespaces, i.e. a backslash in front of a whitespace
--no@... custom flags

This does not attempt to handle full bourne-shell quoting semantics. It will only handle quotes around the entire flag, e.g. "--foo=some value" or around the flag value, e.g. --foo="some value". In practice, I doubt there are many cases where anyone would want to do something other than that.

I agree, that seems to be a sensible assumption

adam-azarchs commented 2 months ago

Thanks for this Pull Request! It looks already really good!

Thanks!

I quickly checked this on a file with a couple of edge cases. It seems the following cases are not handled, yet:

multiline commands, i.e. a backslash at the end of the line

escaped whitespaces, i.e. a backslash in front of a whitespace

Yes, I should fix these. Textmate grammars make this a bit tricky to do, however, as they don't support multiline regexes. I won't promise to fix all of these edge cases. Handling of these semantics are difficult in general because there isn't support for a "multi-pass" approach to tokenization. Also, a shell will first generate a sequence of tokens, turning e.g.

blah "blah"'"'"blah""blah\"" blah

into three tokens blah, blah"blahblah", and blah, and only then try to figure out what to do with each token. TextMate grammars are too limited to do that kind of multi-stage processing, so we basically face a choice between giving up on deeper understanding of what each token means or giving up on trying to handle every quoting/escaping situation. I'd rather do the latter.

--no@... custom flags

That's easy enough to fix.

This does not attempt to handle full bourne-shell quoting semantics. It will only handle quotes around the entire flag, e.g. "--foo=some value" or around the flag value, e.g. --foo="some value". In practice, I doubt there are many cases where anyone would want to do something other than that.

I agree, that seems to be a sensible assumption

Basically, since a .bazelrc is not a shell script, there are a lot of things it doesn't support (subshells, functions, redirections, parameter expansion, command substitution...), so most of the reasons why someone would want to use most of those "unusual" quoting patterns go away. While technically valid, they're unnecessarily obfuscating and if someone is doing that, having the syntax highlighting flag it up as questionable is probably for the best.

With the above in mind, going through your listed edge cases:

```
# A \
  multiline \
  comment
```
Fixed. I'll note this one isn't actually handled by the shell syntax highlighting that's used by default currently, either. Because I don't think it's actually valid in a shell. But, it seems it is valid in a .bazelrc, so I can make the syntax highlighting handle it.
```
build -k
```
fixed
```
build: --keep_going
```
fixed
```
buil"d:my"config --keep_going
```
Falls under "unusual" quoting patterns that I said I wasn't going to support.
```
build:my\ config --keep_going
```
fixed
```
build --no@//my/package:bool_flag
```
fixed
Various invalid target-specified flags, I guess I was a bit lazy in not repeating the full regex. Fixed.
Multi-line flag lists: fixed, at least if the line break is between flags. Trickier to get it between a flag and its value, or within a flag value, given TextMate grammar's lack of support for multi-line regex. I'm somewhat disinclined to worry too much about those cases, though I definitely can see a use case for
```
build --flag \
 "very long value that you want to put on a separate line to stay within some length limit"
```
Also, a minor refinement but I've made startup:some_config show as invalid, because according to documentation

Setting startup:config-name --some_startup_option in the .bazelrc will be ignored.

Technically "ignored" isn't the same thing as invalid, but users would probably prefer to have the visual hint that what they're doing won't work the way they probably expect it to.
Widened the regex for non-user-defined-flags to allow capitol letters because --objc_debug_with_GLIBCXX exists as the lone documented non-user-defined flag that isn't only lower-case letters, numbers, and underscores.

cameron-martin commented 2 months ago

I guess we don't have a unit testing framework set up yet, but if we did, would it be possible to add tests for this via the method you mention in the PR description?

cameron-martin commented 2 months ago

I guess we don't even need a unit testing framework to use vscode-tmgrammar-test. It'd be useful to have this running on CI and some snapshots checked into source control.

vogelsgesang commented 2 months ago

TextMate grammars are too limited to do that kind of multi-stage processing, so we basically face a choice between giving up on deeper understanding of what each token means or giving up on trying to handle every quoting/escaping situation. I'd rather do the latter.

Note I have a not-yet-opensourced language server for bazelrc files laying around. In the long run, if we integrate that language server and let it take care of adding the deeper semantic understanding, I guess the grammar should go back to focusing on the tokenization. But that's something for a future discussion and for the current PR I agree with your tradeoff

One last thing I noticed: We now also accept line continuations with whitespaces before the line break. Bazel only continues the line on backslashes immediately before the newline.

Except for that, the functionality looks good to me now, and I would be happy to merge this. Agree with @cameron-martin, though, that adding test cases would be a nice addition! (Please let us know if we are asking too much. I am already very happy with the contribution as is 🙂)

adam-azarchs commented 2 months ago

I guess we don't even need a unit testing framework to use vscode-tmgrammar-test. It'd be useful to have this running on CI and some snapshots checked into source control.

I'd be happy to add that. My main hesitation was that the project was just the first google hit for "vscode syntax grammar unit testing" and doesn't appear to be super-popular. As an xoogler I'm very well aware of how involved the process of adding any kind of third-party dependencies can be. But it's probably still safe enough in a github action with appropriately limited permissions if that would be acceptable; I have basically zero experience with buildkite.

Note I have a not-yet-opensourced language server for bazelrc files laying around. In the long run, if we integrate that language server and let it take care of adding the deeper semantic understanding, I guess the grammar should go back to focusing on the tokenization. But that's something for a future discussion and for the current PR I agree with your tradeoff

A language server seemed a bit bit overkill to me, but then again it could add e.g. links/hovertext for documentation of known flags and whatnot, which would actually be really helpful because the web docs are a bit difficult to scan through.

One last thing I noticed: We now also accept line continuations with whitespaces before the line break. Bazel only continues the line on backslashes immediately before the newline.

Ah, whopse, will fix that, maybe this evening. I'll also take another stab at trying to get mult-line continuation working between a flag and its value; having slept on it a bit, I think it's maybe not so bad.

Except for that, the functionality looks good to me now, and I would be happy to merge this. Agree with @cameron-martin, though, that adding test cases would be a nice addition! (Please let us know if we are asking too much. I am already very happy with the contribution as is 🙂)

Thanks! Let me know if a github action is acceptable for such a CI check, or if you'd prefer I could just contribute the test case file and let someone else figure out how to wire it up.

One other question, I've been developing this grammar in yaml form, because it's a lot more convenient to work with in my opinion. Would you like that in the PR as well, or just the converted-to-json form?

vogelsgesang commented 2 months ago

Let me know if a github action is acceptable for such a CI check, or if you'd prefer I could just contribute the test case file and let someone else figure out how to wire it up.

For the time being, I would prefer if you could integrate those tests such that they can be run as npm run test and then add that npm run test invocation also to the Bazel CI config. (Long-term, based on the discussion in https://github.com/orgs/bazelbuild/discussions/2, it's a bit unclear to me where this project will move regarding CI in the mid-term)

vogelsgesang commented 2 months ago

I've been developing this grammar in yaml form, because it's a lot more convenient to work with in my opinion. Would you like that in the PR as well, or just the converted-to-json form?

What is the "convert-to-json" step? Is it reversible? If it's a reversible step, I would prefer if we only keep the JSON checked in (since that's the established standard), but also add a subsection to https://github.com/bazelbuild/vscode-bazel/blob/master/CONTRIBUTING.md#setting-up-your-development-environment to document how to switch forth and back from/to YAML

adam-azarchs commented 2 months ago

I got a kind of partial fix for line continuation followed by flag value. I think I've had enough of trying to reverse-engineer the undocumented bits of textmate grammars to figure out how to do better than that, so I think this is as good as it's going to get.

For the time being, I would prefer if you could integrate those tests such that they can be run as npm run test and then add that npm run test invocation also to the Bazel CI config. (Long-term, based on the discussion in https://github.com/orgs/bazelbuild/discussions/2, it's a bit unclear to me where this project will move regarding CI in the mid-term)

That requires putting the tool in the dev dependencies in the package.json, but if you're fine with that then I'm fine with it.

I've been developing this grammar in yaml form, because it's a lot more convenient to work with in my opinion. Would you like that in the PR as well, or just the converted-to-json form?

What is the "convert-to-json" step? Is it reversible? If it's a reversible step, I would prefer if we only keep the JSON checked in (since that's the established standard), but also add a subsection to https://github.com/bazelbuild/vscode-bazel/blob/master/CONTRIBUTING.md#setting-up-your-development-environment to document how to switch forth and back from/to YAML

See https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#developing-a-new-grammar-extension, "Using YAML to write a grammar":

Yaml grammars have the exact same structure as a json based grammar but allow you to use yaml's more concise syntax, along with features such as multi-line strings and comments. VS Code can only load json grammars, so yaml based grammars must be converted to json. The js-yaml package and command-line tool makes this easy.

Converting yaml to json is a well-defined operation. However, because yaml has e.g. several ways to write a string, as well as supporting comments (which json does not), you lose something in the conversion. The main advantage IMO is not needing to escape \ or " in strings (some ways to write strings, anyway), which keeps regexes easier to read. It's probably not worth the trouble of setting up the necessary CI to enforce that they stay in sync, though.

vogelsgesang commented 2 months ago

My 2cts below. Would also be interested in @cameron-martin and / or @jfirebaugh's opinion, though 🙂

That requires putting the tool in the dev dependencies in the package.json, but if you're fine with that then I'm fine with it.

I think it's a reasonable 3pp dependency which we don't want to rewrite. As such, I am fine with adding this as a dev dependency.

See https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#developing-a-new-grammar-extension, "Using YAML to write a grammar":

Yes, I think it would make sense adding the YAML version, too. Midterm, I think we should then add js-yaml as part of our build system, and delete the JSON file again. But that can be future work, not necessarily part of this review...

cameron-martin commented 2 months ago

I'm fine with adding this extra dependency, and also agree that if YAML is preferred for authoring then that should be in source control, with JSON created during the build (but also fine with not doing that in this PR).

adam-azarchs commented 2 months ago

Ok, I've checked in the yaml instead of the json, and updated scripts/build.sh to do the conversion. I think you'll probably agree that the yaml form is more legible by humans.

I've also added an npm run test command for checking the grammar against the snapshot which I added, and update-snapshot to update that snapshot.

vogelsgesang commented 2 months ago

Thank again for your contribution 🚀

vogelsgesang commented 3 weeks ago

A language server seemed a bit bit overkill to me, but then again it could add e.g. links/hovertext for documentation of known flags and whatnot, which would actually be really helpful because the web docs are a bit difficult to scan through.

I finally found the time to open-source https://github.com/salesforce-misc/bazelrc-lsp

bazelbuild / vscode-bazel

feat: syntax highlighting for .bazelrc #379