📎 Use `biome_glob` for the Biome config file

Description

Biome currently uses a fork of the glob crate. This crate has known shortcomings that can be considered as bugs by users. See #2421 and #3345. For instance the globs src/** and *.txt are currently interpreted as **/src/** and **/*.txt. Also, we would like in the future to support convenient syntaxes such as #2986.

In the past months, several linter or assist rules required globs. Instead of using our glob fork, I introduced biome_glob::Glob (Previously known as RestrictedGlob). This currently encapsulates the globset crate and provides glob list with exceptions (negated globs).

Our current Biome configuration file uses globs in include and ignore fields. Switching from our glob fork to biome_glob could cause breakage on users. This is why I propose to keep the include and ignore fields bound to our fork of glob, to deprecate them, and to introduce a new field includes (include with a trailing s). includes will use a glob list with exceptions.

Also, to make top-level includes more accessible, to propose to place it directly at the root of the configuration (i.e. includes instead of files.include/files.ignore). See the migration example.

Tasks:

[ ] Introduce the new field includes
[ ] Deprecate include/ignore and provide an automated migration (using biome migrate) to ease migration from include/ignore to includes.

Implementation notes

Our file crawler currently doesn't traverse a directory if the directory is ignored in a ignore file. I introduced a method biome_glob::CandidatePath::matches_directory_with_exceptions to preserve this behavior.

Migration example

The migration tool will have in charge of translating include/ignore patterns into includes. For instance the globs src/** and *.txt have to be translated as **/src/** and **/*.txt.

  {
    "files": {
-     "include": ["src/**"],
-     "ignore": ["src/**/*.test.js"]
    },
+  "includes": [
+     "**/src/**",
+     "!**/src/**/*.test.js"
+  ],
    "linter": {
-     "ignore": ["**/test/**"],
+     "includes": ["!**/test/**"],
    },
    "overrides": [
      {
-       "include": ["test/**"],
+       "includes": ["**/test/**"],
      }
    ]
  }

Discussion

includes and include are really close (just a trailing s creates the difference). This could cause some confusion. An alternative name could be files. Unfortunately this conflicts with the top-level files config. If we want to use files we will have to deprecate the top-level files object and to find an alternative way of configuring the files options. Maybe the confusion is ok because include will be deprecated anyway and eventually removed.

Good idea!

Switching from our glob fork to biome_glob could cause breakage on users. This is why I propose to keep the include and ignore fields bound to our fork of glob, to deprecate them, and to introduce a new field includes (include with a trailing s).

If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all? As long as biome migrate handles the translation, I think it should be fine to make a hard cut. Of course, if we want to release this functionality as part of a minor release, then you're right, we might need to keep around for a bit.

I wouldn't want to keep two fields, it brings too much maintenance burden for us. I would go for a clean cut. That's what breaking changes are for.

What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.

We will explain the entity of the breaking change to users and provide a migration path .

If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all?

This is a good point. I could at least keep the two fields until we implement the migration. This should happen before Biome 2.0, thus we could remove them before releasing Biome 2.0

What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.

I think it is possible to cover most of the cases and emit a diagnostic for cases we cannot convert.

Best migration path ever :)

One note of caution: Biome supports .editorconfig which has a slightly different pattern format than globset and even more different RestrictedGlob. If Biome continues to support .editorconfig (which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.

Because it's hard to change the pattern syntax afterwards (as evidenced by this issue), we should probably make sure to get it right "this time". Especially because there are some subtleties regarding user expectations as well as a lot of corner cases involved (which is the reason why this comment got as long as it is now...).

Here's an overview of the syntax & behavior differences (and similarities) I could find:

(Note: I didn't compare with biome_service::matcher::Pattern)

Special Characters	`.editorconfig`[^1]	`globset`[^2]	`RestrictedGlob`
Case sensitivity	Case sensitive	Configurable	Case sensitive
patterns without `/` (e.g. `*.js`)	match files in all subdirectories (e.g. `$(rootdir)/hello.js`, `$(rootdir)/foo/world.js`)	Never automatically expanded to subdirectories
patterns with `/` (e.g. `Bar/*.js`)	slash makes the pattern not match in subdirectories (`Bar/*.js` matches `$(rootdir)/Bar/hello.js` but not `$(rootdir)/bat/Bar/world.js`)	Never automatically expanded to subdirectories
`*`	zero or more characters, except path separators (`/`)[^4]	zero or more characters (except path separators (`/`) if `literal_separator` option is enabled)	zero or more characters, except path separators `/`
`**`	any string of characters, allowed anywhere in the pattern
`**/pattern`	files matching `pattern` in all directories and subdirectories	files matching `pattern` in all directories and subdirectories. ``must be delimited by`/` or the pattern start/end**.
`pattern/**`	matches all files, subdirectories and files within subdirectories, starting at the directories matched by `pattern`	matches all files, subdirectories and files within subdirectories, starting at the directories matched by `pattern`
`pattern/**/pattern`	`/**/` within the pattern matches zero or more directories	`/**/` within the pattern matches zero or more directories
e.g. `**/foo`	matches `foo` and `bar/foo` but not `foo/bar`	matches `foo` and `bar/foo` but not `foo/bar`
e.g. `foo/**`	matches `foo/a` and `foo/a/b`, but not `foo` or `subdir/foo/a/b`	matches `foo/a` and `foo/a/b`, but not `foo`
e.g. `foo/**/bar`	matches `foo/bar`, `foo/baz/bar` and `foo/baz/bax/bar`, but not `subdir/foo/bar/bar`	matches `foo/bar`, `foo/baz/bar` and `foo/baz/bax/bar`
`?`	any single character, except path separators (`/`)	any single character (except path separators (`/`) if `literal_separator` option is enabled)
`[seq]`	any single character in `seq`	any single character in `seq`	Not supported
`[!seq]`	any single character not in `seq`	any single character not in `seq`	Not supported
`{s1,s2,s3}`	any of the strings given (separated by commas, can be nested) (But `{s1}` only matches `{s1}` literally.)	any of the strings given (nesting is not currently allowed, and *`{foo}` matches `foo`, but not* `{foo}` literally**)	Not supported
`{num1..num2}`	any integer numbers between `num1` and `num2`, where `num1` and `num2` can be either positive or negative		Not supported
`\X`	Special characters can be escaped with a backslash so they won't be interpreted as wildcard patterns.	When backslash escapes are enabled, a backslash (`\`) will escape all meta characters in a glob. If it precedes a non-meta character, then the slash is ignored. A `\\` will match a literal `\\`. Note that this mode is only enabled on Unix platforms by default, but can be enabled on any platform via the `backslash_escape` setting on Glob.	`\` must be followed by one of `*?!{}[]\`

Based on all this research, I've come to the following conclusions:

Wildcard patterns are currently only used within Biome for file paths and import paths (which are similar to, but still not quite like, file paths). Everything else is better handled with regex anyway.
The user expectations for file path matching vs. import path matching are subtly different:
- The handling of foo and *.foo should be treated the same w.r.t. subdirectory handling.
- Import path matching attempts to match against the full string
- If I specify test-utils for an import path pattern, it should only match "test-utils", and not "@biomejs/test-utils".
- *.ext should be treated the same.
- File path matching trades consistency against "not having mention directories in your pattern if you don't care about directories" (i.e. users say *.js and almost always mean **/*.js)
- If I specify *.js for a file path pattern, I probably want to match all JS files in all directories. This is a convenience to avoid having to write **/*.js everywhere.
- So for consistency, Dockerfile should also match subdirectory/Dockerfile.
  - I would recommend the .gitignore behavior of using /Dockerfile to indicate a pattern that matches Dockerfile but not subdirectory/Dockerfile. On further inspection, it turns out that .editorconfig handles it the same.
- If a / is present somewhere in the pattern, I probably mean just that directory. Therefore, do not implicitly prepend **/, so foo/bar/** matches foo/bar/baz but not subdir/foo/bar/baz
.editorconfig pattern matching is thought out and should cover all our use cases, and we need to implement it anyway.
It would be best to use a single pattern syntax than multiple ones (though we need a flag for "import path" vs. "file path" that controls the implicit conditional **/ behavior).

In conclusion, I would recommend making the .editorconfig pattern matching syntax the default going forward, cause it already has a battery of tests associated with it and has figured out a sensible way for subdirectory matching, avoiding problems like this, and just allow all its syntactic constructs for consistency.

I would even go as far as recommending contributing a Rust version of the editorconfig-core C library (there are official versions for C, Python, Ruby, JS... but none for Rust (yet)), and to use that for .editorconfig parsing, and share its glob implementation.

Even if it is decided to not go down that route, it would still be beneficial to:

assemble a collection of example patterns and how users expect them to act, possibly based on past GitHub issues, and to
turn the official .editorconfig tests into a format we can use and use it to spot differences

[^1]: .editorconfig behavior was researched using the .editorconfig spec and the editorconfig-core-test compatibility test suite. [^2]: globset behavior was researched using its docs and its integrated tests. [^4]: The .editorconfig spec) says any string of characters, but is unclear if that means zero or more or one or more characters. I would propose using the official tests for the editorconfig-core library as the authoritative source, which specify * as zero or more characters.

One note of caution: Biome supports .editorconfig which has a slightly different pattern format than globset and even more different RestrictedGlob. If Biome continues to support .editorconfig (which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.

I think we should use a different glob type for .gitignore and .editorconfig because they have dedicated formats that can evolve independently. By the way the current .editorconfig spec under-specifies the supported globs. I opened an issue on their repository.

Also, I plan to add support of ASCII character classes and non-empty non-nested literal alternations {s1,s2} for biome_glob.

The user expectations for file path matching vs. import path matching are subtly different

The difference is very small. I could personally use exactly the same syntax for consistency and simplicity. I am not opposed to treat *.js as **/*.js if we think it is better. I initially chose to not do that to match regular shell globs.

Some months ago I started a glob library to cover .gitignore, .editorconfig, and regular shell globs. I still plan to resume my work at some point. However, we need a glob library for Biome 2.0. This is why I introduced the biome_glob::Glob abstraction to avoid a hard dependency over globset.

I think we should use a different glob type for .gitignore and .editorconfig because they have dedicated formats that can evolve independently.

Fair point, given the potential for future changes, which I haven't taken into account. .gitignore is already handled via a dedicated external create. The same would probably make sense for .editorconfig then. An editorconfig-core-rust library that also takes care of .editorconfig files in subdirectories would be best.

The user expectations for file path matching vs. import path matching are subtly different

The difference is very small. I could personally use exactly the same syntax for consistency and simplicity. I am not opposed to treat *.js as **/*.js if we think it is better. I initially chose to not do that to match regular shell globs.

Well, then I think it would be best to skip the implicit *.js => **/*.js conversion, and always match against the full string. Either way, the docs are pretty thin-worded regarding the wildcard pattern syntax and its interpretation - it would be great to document at least the basics there :)

biomejs / biome