biomejs / biome

A toolchain for web projects, aimed to provide functionalities to maintain them. Biome offers formatter and linter, usable via CLI and LSP.
https://biomejs.dev
Apache License 2.0
15.55k stars 485 forks source link

📎 Use `biome_glob` for the Biome config file #4611

Open Conaclos opened 4 days ago

Conaclos commented 4 days ago

Description

Biome currently uses a fork of the glob crate. This crate has known shortcomings that can be considered as bugs by users. See #2421 and #3345. For instance the globs src/** and *.txt are currently interpreted as **/src/** and **/*.txt. Also, we would like in the future to support convenient syntaxes such as #2986.

In the past months, several linter or assist rules required globs. Instead of using our glob fork, I introduced biome_glob::Glob (Previously known as RestrictedGlob). This currently encapsulates the globset crate and provides glob list with exceptions (negated globs).

Our current Biome configuration file uses globs in include and ignore fields. Switching from our glob fork to biome_glob could cause breakage on users. This is why I propose to keep the include and ignore fields bound to our fork of glob, to deprecate them, and to introduce a new field includes (include with a trailing s). includes will use a glob list with exceptions.

Also, to make top-level includes more accessible, to propose to place it directly at the root of the configuration (i.e. includes instead of files.include/files.ignore). See the migration example.

Tasks:

Implementation notes

Our file crawler currently doesn't traverse a directory if the directory is ignored in a ignore file. I introduced a method biome_glob::CandidatePath::matches_directory_with_exceptions to preserve this behavior.

Migration example

The migration tool will have in charge of translating include/ignore patterns into includes. For instance the globs src/** and *.txt have to be translated as **/src/** and **/*.txt.

  {
    "files": {
-     "include": ["src/**"],
-     "ignore": ["src/**/*.test.js"]
    },
+  "includes": [
+     "**/src/**",
+     "!**/src/**/*.test.js"
+  ],
    "linter": {
-     "ignore": ["**/test/**"],
+     "includes": ["!**/test/**"],
    },
    "overrides": [
      {
-       "include": ["test/**"],
+       "includes": ["**/test/**"],
      }
    ]
  }

Discussion

includes and include are really close (just a trailing s creates the difference). This could cause some confusion. An alternative name could be files. Unfortunately this conflicts with the top-level files config. If we want to use files we will have to deprecate the top-level files object and to find an alternative way of configuring the files options. Maybe the confusion is ok because include will be deprecated anyway and eventually removed.

arendjr commented 4 days ago

Good idea!

Switching from our glob fork to biome_glob could cause breakage on users. This is why I propose to keep the include and ignore fields bound to our fork of glob, to deprecate them, and to introduce a new field includes (include with a trailing s).

If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all? As long as biome migrate handles the translation, I think it should be fine to make a hard cut. Of course, if we want to release this functionality as part of a minor release, then you're right, we might need to keep around for a bit.

ematipico commented 4 days ago

I wouldn't want to keep two fields, it brings too much maintenance burden for us. I would go for a clean cut. That's what breaking changes are for.

What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.

We will explain the entity of the breaking change to users and provide a migration path .

Conaclos commented 4 days ago

If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all?

This is a good point. I could at least keep the two fields until we implement the migration. This should happen before Biome 2.0, thus we could remove them before releasing Biome 2.0

What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.

I think it is possible to cover most of the cases and emit a diagnostic for cases we cannot convert.

ematipico commented 4 days ago

Best migration path ever :)

cr7pt0gr4ph7 commented 3 days ago

One note of caution: Biome supports .editorconfig which has a slightly different pattern format than globset and even more different RestrictedGlob. If Biome continues to support .editorconfig (which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.

Because it's hard to change the pattern syntax afterwards (as evidenced by this issue), we should probably make sure to get it right "this time". Especially because there are some subtleties regarding user expectations as well as a lot of corner cases involved (which is the reason why this comment got as long as it is now...).

Here's an overview of the syntax & behavior differences (and similarities) I could find:

(Note: I didn't compare with biome_service::matcher::Pattern)

Special Characters .editorconfig[^1] globset[^2] RestrictedGlob
Case sensitivity Case sensitive Configurable Case sensitive
patterns without / (e.g. *.js) match files in all subdirectories (e.g. $(rootdir)/hello.js, $(rootdir)/foo/world.js) Never automatically expanded to subdirectories
patterns with / (e.g. Bar/*.js) slash makes the pattern not match in subdirectories (Bar/*.js matches $(rootdir)/Bar/hello.js but not $(rootdir)/bat/Bar/world.js) Never automatically expanded to subdirectories
* zero or more characters, except path separators (/)[^4] zero or more characters (except path separators (/) if literal_separator option is enabled) zero or more characters, except path separators /
** any string of characters, allowed anywhere in the pattern
**/pattern files matching pattern in all directories and subdirectories files matching pattern in all directories and subdirectories. `must be delimited by/` or the pattern start/end**.
pattern/** matches all files, subdirectories and files within subdirectories, starting at the directories matched by pattern matches all files, subdirectories and files within subdirectories, starting at the directories matched by pattern
pattern/**/pattern /**/ within the pattern matches zero or more directories /**/ within the pattern matches zero or more directories
e.g. **/foo matches foo and bar/foo but not foo/bar matches foo and bar/foo but not foo/bar
e.g. foo/** matches foo/a and foo/a/b, but not foo or subdir/foo/a/b matches foo/a and foo/a/b, but not foo
e.g. foo/**/bar matches foo/bar, foo/baz/bar and foo/baz/bax/bar, but not subdir/foo/bar/bar matches foo/bar, foo/baz/bar and foo/baz/bax/bar
? any single character, except path separators (/) any single character (except path separators (/) if literal_separator option is enabled)
[seq] any single character in seq any single character in seq Not supported
[!seq] any single character not in seq any single character not in seq Not supported
{s1,s2,s3} any of the strings given (separated by commas, can be nested) (But {s1} only matches {s1} literally.) any of the strings given (nesting is not currently allowed, and {foo} matches foo, but not {foo} literally) Not supported
{num1..num2} any integer numbers between num1 and num2, where num1 and num2 can be either positive or negative Not supported
\X Special characters can be escaped with a backslash so they won't be interpreted as wildcard patterns. When backslash escapes are enabled, a backslash (\) will escape all meta characters in a glob. If it precedes a non-meta character, then the slash is ignored. A \\ will match a literal \\. Note that this mode is only enabled on Unix platforms by default, but can be enabled on any platform via the backslash_escape setting on Glob. \ must be followed by one of *?!{}[]\

Based on all this research, I've come to the following conclusions:

In conclusion, I would recommend making the .editorconfig pattern matching syntax the default going forward, cause it already has a battery of tests associated with it and has figured out a sensible way for subdirectory matching, avoiding problems like this, and just allow all its syntactic constructs for consistency.

I would even go as far as recommending contributing a Rust version of the editorconfig-core C library (there are official versions for C, Python, Ruby, JS... but none for Rust (yet)), and to use that for .editorconfig parsing, and share its glob implementation.

Even if it is decided to not go down that route, it would still be beneficial to:

[^1]: .editorconfig behavior was researched using the .editorconfig spec and the editorconfig-core-test compatibility test suite. [^2]: globset behavior was researched using its docs and its integrated tests. [^4]: The .editorconfig spec) says any string of characters, but is unclear if that means zero or more or one or more characters. I would propose using the official tests for the editorconfig-core library as the authoritative source, which specify * as zero or more characters.

Conaclos commented 3 days ago

One note of caution: Biome supports .editorconfig which has a slightly different pattern format than globset and even more different RestrictedGlob. If Biome continues to support .editorconfig (which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.

I think we should use a different glob type for .gitignore and .editorconfig because they have dedicated formats that can evolve independently. By the way the current .editorconfig spec under-specifies the supported globs. I opened an issue on their repository.

Also, I plan to add support of ASCII character classes and non-empty non-nested literal alternations {s1,s2} for biome_glob.

The user expectations for file path matching vs. import path matching are subtly different

The difference is very small. I could personally use exactly the same syntax for consistency and simplicity. I am not opposed to treat *.js as **/*.js if we think it is better. I initially chose to not do that to match regular shell globs.

Some months ago I started a glob library to cover .gitignore, .editorconfig, and regular shell globs. I still plan to resume my work at some point. However, we need a glob library for Biome 2.0. This is why I introduced the biome_glob::Glob abstraction to avoid a hard dependency over globset.

cr7pt0gr4ph7 commented 3 days ago

I think we should use a different glob type for .gitignore and .editorconfig because they have dedicated formats that can evolve independently.

Fair point, given the potential for future changes, which I haven't taken into account. .gitignore is already handled via a dedicated external create. The same would probably make sense for .editorconfig then. An editorconfig-core-rust library that also takes care of .editorconfig files in subdirectories would be best.

The user expectations for file path matching vs. import path matching are subtly different

The difference is very small. I could personally use exactly the same syntax for consistency and simplicity. I am not opposed to treat *.js as **/*.js if we think it is better. I initially chose to not do that to match regular shell globs.

Well, then I think it would be best to skip the implicit *.js => **/*.js conversion, and always match against the full string. Either way, the docs are pretty thin-worded regarding the wildcard pattern syntax and its interpretation - it would be great to document at least the basics there :)