`vcs.useIgnoreFile` ignores nested ignore files

samhh commented 3 months ago

Environment information

CLI:
  Version:                      1.6.4
  Color support:                true

Platform:
  CPU Architecture:             aarch64
  OS:                           macos

Environment:
  BIOME_LOG_DIR:                unset
  NO_COLOR:                     unset
  TERM:                         "screen"
  JS_RUNTIME_VERSION:           "v20.11.1"
  JS_RUNTIME_NAME:              "node"
  NODE_PACKAGE_MANAGER:         "yarn/4.1.0"

Biome Configuration:
  Status:                       Loaded successfully
  Formatter disabled:           false
  Linter disabled:              false
  Organize imports disabled:    false
  VCS disabled:                 false

Workspace:
  Open Documents:               0

What happened?

vcs.useIgnoreFile respects :/.gitignore but not nested ignore files such as :/packages/foo/.gitignore.

Expected result

vcs.useIgnoreFile should respect nested ignore files like Git does. They're helpful in monorepos.

Code of Conduct

[X] I agree to follow Biome's Code of Conduct

Conaclos commented 3 months ago

This is a known limitation. I am unsure if this is properly documented.

castarco commented 3 months ago

From what I see in the code, the contents from .gitignore files are retrieved via the function retrieve_git_ignore_matches.

I checked the places where this function is called, and (for the relevant cases where this issue applies), we always have an available paths instance that could be used to our advantage.

I think that we could apply the following strategy:

At the call sites, reorder some function calls to ensure that paths is already populated by the time we call retrieve_git_ignore_matches.
Create a "sibling function" for retrieve_git_ignore_matches that basically does the same, but it also accepts a paths parameter (let's call it retrieve_merged_git_ignore_matches for now). We do that because we want to keep the old function for the places where we call it without having access to a paths variable with file paths in it.
Inside retrieve_merged_ignore_matches, transform the param paths into a shorter array, with only deduplicated directories (this is "just" a map + flatmap + reduce), for this description let's call this variable nested_dirs.
We let the "old flow" proceed, but once we obtain the vector as it is computed today, we don't return it yet.
We iterate over nested_dirs and check if we can find .gitignore files in them (without using the auto_search feature). In case we find them, we load them, we obtain their lines... but we pass them through a map function to adapt them (for example, prepending the relevant dir paths to the stated rules).
We merge the nested rules into the "main rules"
We return

EDIT:

Example. If we had a .gitignore file at the root, with the contents:

node_modules/

and then another one at ./nested/path/.gitignore, with the contents:

./cache/
*.log

The "merged" result would be as if the root .ignorefile had:

node_modules/
./nested/path/cache/
./nested/path/**/*.log # in this case we add an ** infix

EDIT 2:

~An extra doubt that came to me once I was looking at the code is if we adapt these ignore rules in any way once they are loaded.~

~I see that we use the auto_search function to look for .gitignore files up to the root of the filesystem if they are not immediately available, but sometimes these rules are relative to the directory where they are placed, and I didn't see any transformation logic for them in the retrieve_git_ignore_matches function (I doubt it is applied later, becase we loose the contextual information telling us at which directory level those rules were placed).~

Nevermind, I see that the function returns a tuple, not just a rules vector.

castarco commented 3 months ago

Well, it seems I was too optimistic. paths is not always "expanded" at that point, so those values wouldn't be enough.

It is also necessary to replicate part of the logic inside of the biome_fs crate, in the os module (functions such as handle_dir and handle_any_file), but with 2 differences:

Without triggering any handler, just to obtain the relevant paths.
Following a slightly more synchronous approach (without calling spawn), because we need to collect the results of that traversal as soon as possible (unless there is an elegant way to do that asynchronously, I'm not an expert in Rust. Using a mutex over a shared data structure?).

@Conaclos Do you think the approach I outlined makes sense? Or would it hurt performance too much because of the synchronous file traversal? (I'm not used to deal with highly optimised code, so I'm not entirely sure).

Conaclos commented 3 months ago

I think @ematipico knows more about this.

ematipico commented 3 months ago

Because of the current approach, it isn't possible to perfectly coordinate and collect the information when we traverse the file system.

We have been talking about refactoring the traversal to make it a bit different and make it more "synchronous" as you suggested. The first step might be to actually not spawning a process when traversing a folder.

Instead, we should first pause and check for relevant files such as .gitignore. After we collected the info we need from these key files, I think it's safe starting to spawn processed for the rest of the files (and ignore the files we already read, e.g. the .gitignore files)

From what I see in the code, the contents from .gitignore files are retrieved via the function retrieve_git_ignore_matches.

I checked the places where this function is called, and (for the relevant cases where this issue applies), we always have an available paths instance that could be used to our advantage.

Are you talking about retrieve_gitignore_matches? What paths are you referring too? Maybe a link to the code would help understanding the context.

castarco commented 3 months ago

Hi @ematipico , I'm sorry, I missed your response.

I have a working branch with a lot of ugly code that illustrates what I was trying to do. I can create a draft PR, so it will be easier to discuss it.

My approach has been to only look for the "relevant" nested .gitignore files, that is, to not explore the whole directories tree, but only the specific directories that could affect the traversal, and create a sort of global merged .gitignore file, to avoid having to switch configurations as we do the traversal.

biomejs / biome