Open Conaclos opened 4 days ago
Good idea!
Switching from our glob fork to biome_glob could cause breakage on users. This is why I propose to keep the
include
andignore
fields bound to our fork ofglob
, to deprecate them, and to introduce a new fieldincludes
(include with a trailing s).
If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all? As long as biome migrate
handles the translation, I think it should be fine to make a hard cut. Of course, if we want to release this functionality as part of a minor release, then you're right, we might need to keep around for a bit.
I wouldn't want to keep two fields, it brings too much maintenance burden for us. I would go for a clean cut. That's what breaking changes are for.
What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.
We will explain the entity of the breaking change to users and provide a migration path .
If we make this change with the introduction of Biome 2.0, do we need to keep the deprecated fields at all?
This is a good point. I could at least keep the two fields until we implement the migration. This should happen before Biome 2.0, thus we could remove them before releasing Biome 2.0
What we can try to do as much as we can with the migration command, it's possible we won't be able to do that and it's fine. We will work with users and early adopters to fix possible bugs.
I think it is possible to cover most of the cases and emit a diagnostic for cases we cannot convert.
Best migration path ever :)
One note of caution: Biome supports .editorconfig
which has a slightly different pattern format than globset
and even more different RestrictedGlob
. If Biome continues to support .editorconfig
(which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.
Because it's hard to change the pattern syntax afterwards (as evidenced by this issue), we should probably make sure to get it right "this time". Especially because there are some subtleties regarding user expectations as well as a lot of corner cases involved (which is the reason why this comment got as long as it is now...).
Here's an overview of the syntax & behavior differences (and similarities) I could find:
(Note: I didn't compare with biome_service::matcher::Pattern
)
Special Characters | .editorconfig [^1] |
globset [^2] |
RestrictedGlob |
---|---|---|---|
Case sensitivity | Case sensitive | Configurable | Case sensitive |
patterns without / (e.g. *.js ) |
match files in all subdirectories (e.g. $(rootdir)/hello.js , $(rootdir)/foo/world.js ) |
Never automatically expanded to subdirectories | |
patterns with / (e.g. Bar/*.js ) |
slash makes the pattern not match in subdirectories (Bar/*.js matches $(rootdir)/Bar/hello.js but not $(rootdir)/bat/Bar/world.js ) |
Never automatically expanded to subdirectories | |
* |
zero or more characters, except path separators (/ )[^4] |
zero or more characters (except path separators (/ ) if literal_separator option is enabled) |
zero or more characters, except path separators / |
** |
any string of characters, allowed anywhere in the pattern | ||
**/pattern |
files matching pattern in all directories and subdirectories |
files matching pattern in all directories and subdirectories. `must be delimited by /` or the pattern start/end**. |
|
pattern/** |
matches all files, subdirectories and files within subdirectories, starting at the directories matched by pattern |
matches all files, subdirectories and files within subdirectories, starting at the directories matched by pattern |
|
pattern/**/pattern |
/**/ within the pattern matches zero or more directories |
/**/ within the pattern matches zero or more directories |
|
e.g. **/foo |
matches foo and bar/foo but not foo/bar |
matches foo and bar/foo but not foo/bar |
|
e.g. foo/** |
matches foo/a and foo/a/b , but not foo or subdir/foo/a/b |
matches foo/a and foo/a/b , but not foo |
|
e.g. foo/**/bar |
matches foo/bar , foo/baz/bar and foo/baz/bax/bar , but not subdir/foo/bar/bar |
matches foo/bar , foo/baz/bar and foo/baz/bax/bar |
|
? |
any single character, except path separators (/ ) |
any single character (except path separators (/ ) if literal_separator option is enabled) |
|
[seq] |
any single character in seq |
any single character in seq |
Not supported |
[!seq] |
any single character not in seq |
any single character not in seq |
Not supported |
{s1,s2,s3} |
any of the strings given (separated by commas, can be nested) (But {s1} only matches {s1} literally.) |
any of the strings given (nesting is not currently allowed, and {foo} matches foo , but not {foo} literally) |
Not supported |
{num1..num2} |
any integer numbers between num1 and num2 , where num1 and num2 can be either positive or negative |
Not supported | |
\X |
Special characters can be escaped with a backslash so they won't be interpreted as wildcard patterns. | When backslash escapes are enabled, a backslash (\ ) will escape all meta characters in a glob. If it precedes a non-meta character, then the slash is ignored. A \\ will match a literal \\ . Note that this mode is only enabled on Unix platforms by default, but can be enabled on any platform via the backslash_escape setting on Glob. |
\ must be followed by one of *?!{}[]\ |
Based on all this research, I've come to the following conclusions:
foo
and *.foo
should be treated the same w.r.t. subdirectory handling.test-utils
for an import path pattern, it should only match "test-utils"
, and not "@biomejs/test-utils"
.*.ext
should be treated the same.*.js
and almost always mean **/*.js
)*.js
for a file path pattern, I probably want to match all JS files in all directories. This is a convenience to avoid having to write **/*.js
everywhere.Dockerfile
should also match subdirectory/Dockerfile
.
.gitignore
behavior of using /Dockerfile
to indicate a pattern that matches Dockerfile
but not subdirectory/Dockerfile
. On further inspection, it turns out that .editorconfig
handles it the same./
is present somewhere in the pattern, I probably mean just that directory. Therefore, do not implicitly prepend **/
, so foo/bar/**
matches foo/bar/baz
but not subdir/foo/bar/baz
.editorconfig
pattern matching is thought out and should cover all our use cases, and we need to implement it anyway.**/
behavior).In conclusion, I would recommend making the .editorconfig
pattern matching syntax the default going forward, cause it already has a battery of tests associated with it and has figured out a sensible way for subdirectory matching, avoiding problems like this, and just allow all its syntactic constructs for consistency.
I would even go as far as recommending contributing a Rust version of the editorconfig-core
C library (there are official versions for C, Python, Ruby, JS... but none for Rust (yet)), and to use that for .editorconfig
parsing, and share its glob implementation.
Even if it is decided to not go down that route, it would still be beneficial to:
.editorconfig
tests into a format we can use and use it to spot differences[^1]: .editorconfig
behavior was researched using the .editorconfig spec
and the editorconfig-core-test
compatibility test suite.
[^2]: globset
behavior was researched using its docs and its integrated tests.
[^4]: The .editorconfig
spec) says any string of characters
, but is unclear if that means zero or more
or one or more
characters. I would propose using the official tests for the editorconfig-core
library as the authoritative source, which specify *
as zero or more characters
.
One note of caution: Biome supports .editorconfig which has a slightly different pattern format than globset and even more different RestrictedGlob. If Biome continues to support .editorconfig (which I think it should), then it should probably support it completely so as not to get in the way of it's users unexpectedly.
I think we should use a different glob type for .gitignore
and .editorconfig
because they have dedicated formats that can evolve independently. By the way the current .editorconfig
spec under-specifies the supported globs. I opened an issue on their repository.
Also, I plan to add support of ASCII character classes and non-empty non-nested literal alternations {s1,s2}
for biome_glob
.
The user expectations for file path matching vs. import path matching are subtly different
The difference is very small. I could personally use exactly the same syntax for consistency and simplicity.
I am not opposed to treat *.js
as **/*.js
if we think it is better.
I initially chose to not do that to match regular shell globs.
Some months ago I started a glob library to cover .gitignore
, .editorconfig
, and regular shell globs.
I still plan to resume my work at some point.
However, we need a glob library for Biome 2.0. This is why I introduced the biome_glob::Glob
abstraction to avoid a hard dependency over globset
.
I think we should use a different glob type for
.gitignore
and.editorconfig
because they have dedicated formats that can evolve independently.
Fair point, given the potential for future changes, which I haven't taken into account. .gitignore
is already handled via a dedicated external create. The same would probably make sense for .editorconfig
then. An editorconfig-core-rust
library that also takes care of .editorconfig
files in subdirectories would be best.
The user expectations for file path matching vs. import path matching are subtly different
The difference is very small. I could personally use exactly the same syntax for consistency and simplicity. I am not opposed to treat
*.js
as**/*.js
if we think it is better. I initially chose to not do that to match regular shell globs.
Well, then I think it would be best to skip the implicit *.js => **/*.js
conversion, and always match against the full string. Either way, the docs are pretty thin-worded regarding the wildcard pattern syntax and its interpretation - it would be great to document at least the basics there :)
Description
Biome currently uses a fork of the glob crate. This crate has known shortcomings that can be considered as bugs by users. See #2421 and #3345. For instance the globs
src/**
and*.txt
are currently interpreted as**/src/**
and**/*.txt
. Also, we would like in the future to support convenient syntaxes such as #2986.In the past months, several linter or assist rules required globs. Instead of using our
glob
fork, I introduced biome_glob::Glob (Previously known asRestrictedGlob
). This currently encapsulates theglobset
crate and provides glob list with exceptions (negated globs).Our current Biome configuration file uses globs in
include
andignore
fields. Switching from our glob fork tobiome_glob
could cause breakage on users. This is why I propose to keep theinclude
andignore
fields bound to our fork ofglob
, to deprecate them, and to introduce a new fieldincludes
(include
with a trailing s).includes
will use a glob list with exceptions.Also, to make top-level
includes
more accessible, to propose to place it directly at the root of the configuration (i.e.includes
instead offiles.include
/files.ignore
). See the migration example.Tasks:
includes
include
/ignore
and provide an automated migration (usingbiome migrate
) to ease migration frominclude
/ignore
toincludes
.Implementation notes
Our file crawler currently doesn't traverse a directory if the directory is ignored in a
ignore
file. I introduced a methodbiome_glob::CandidatePath::matches_directory_with_exceptions
to preserve this behavior.Migration example
The migration tool will have in charge of translating
include
/ignore
patterns intoincludes
. For instance the globssrc/**
and*.txt
have to be translated as**/src/**
and**/*.txt
.Discussion
includes
andinclude
are really close (just a trailing s creates the difference). This could cause some confusion. An alternative name could befiles
. Unfortunately this conflicts with the top-levelfiles
config. If we want to usefiles
we will have to deprecate the top-levelfiles
object and to find an alternative way of configuring the files options. Maybe the confusion is ok becauseinclude
will be deprecated anyway and eventually removed.