GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git
Apache License 2.0
8.91k stars 303 forks source link

`gix clean -xde` deletes whole repo if `.gitignore` lists `*` or `/` #1458

Closed EliahKagan closed 2 months ago

EliahKagan commented 2 months ago

Current behavior 😯

gix clean -xde will delete the entire top-level repository it is operating on, including tracked files and the .git directory itself--thus the whole local history--if a .gitignore file contains * or /. This seems to happen because the repository itself is identified as an untracked nested repository.

Illustration with a real-world example

One approach to writing .gitignore is to list * followed by ! exclusions. In such a repository, running gix clean -xde deletes the entire contents of the repository directory, causing inconvenience and the loss of any unpushed data. For example, on the cargo-update repository:

ek@Glub:~/src$ git clone https://github.com/nabijaczleweli/cargo-update.git
Cloning into 'cargo-update'...
remote: Enumerating objects: 328251, done.
remote: Counting objects: 100% (83271/83271), done.
remote: Compressing objects: 100% (1630/1630), done.
remote: Total 328251 (delta 81523), reused 83220 (delta 81477), pack-reused 244980
Receiving objects: 100% (328251/328251), 110.53 MiB | 30.62 MiB/s, done.
Resolving deltas: 100% (320324/320324), done.
ek@Glub:~/src$ cd cargo-update
ek@Glub:~/src/cargo-update (master=)$ cat .gitignore
*
!.gitignore
!.travis.yml
!gh_rsa.enc
!appveyor.yml
!LICENSE
!Cargo.toml
!rustfmt.toml
!build.rs
!cargo-install-update-manifest.rc
!cargo-install-update.exe.manifest
!*.sublime-project
!*.md
!.github
!.github/**
!src
!src/**
!man
!man/**
!tests
!tests/**
!test-data
!test-data/**
ek@Glub:~/src/cargo-update (master=)$ gix clean -xdn
WOULD remove / ( )

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip
ek@Glub:~/src/cargo-update (master=)$ gix clean -xde
removing / ( )
Error: Invalid argument (os error 22)
ek@Glub:~/src/cargo-update[1]$ ls -al
total 8
drwxr-xr-x  2 ek ek 4096 Jul 21 15:11 .
drwxr-xr-x 20 ek ek 4096 Jul 21 15:10 ..

That is on Ubuntu 22.04 LTS.

Windows is affected when outside the directory

Although Windows superficially appears unaffected because open files and directories cannot usually be deleted, the entire repository directory will still be deleted if one runs gix -r cargo-update clean -xde from the parent directory. (This -r is a gix option and should not be confused with gix clean -r.) So really all systems are affected, though to different degrees.

C:\Users\ek\src\cargo-update [master ≡]> gix clean -xde
removing / (🗑️)
Error: The process cannot access the file because it is being used by another process. (os error 32)
C:\Users\ek\src\cargo-update [master ≡]> cd ..
C:\Users\ek\src> gix -r cargo-update clean -xdn
WOULD remove / (🗑️)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip
C:\Users\ek\src> gix -r cargo-update clean -xde
removing / (🗑️)
C:\Users\ek\src> cd cargo-update
Set-Location: Cannot find path 'C:\Users\ek\src\cargo-update' because it does not exist.

The / means the repository

The / is in each case referring to the top-level directory of the repository. It fortunately does not refer to the actual root of the filesystem. Likewise:

However, all local branches and their history (except the tip of any branches checked out in separate worktrees and unmodified), the index, any stashes, and other objects available through the reflog, can all be deleted.

Some relevant code

It looks like, in traversals, paths that are not conceptually part of the repository are pruned, except that when a .git directory would be pruned for this reason but also matches a .gitignore entry, then it is instead retained and given the Ignored status:

https://github.com/Byron/gitoxide/blob/15f1cf76764221d14afa66d03a6528b19b9c30c9/gix-dir/src/entry.rs#L43-L49

But this is not necessarily to say that this aspect of the classification has to be changed. It may be suitable for most repositories found during traversal, just not for the top-level working tree or .git directory, nor for any submodules or their .git files.

Expected behavior 🤔

One never expects cleaning to remove tracked files or cause data loss in the repository's local history, much less loss of the whole history.

Although the behavior described here was not intended, and exposition may not be required to demonstrate that this is a bug, I've nonetheless detailed what I think is the expected behavior below. This is because I think it may be useful identifying current or future cases of the bug, some of which are less obvious than others.

A * entry in a .gitignore has real-world uses (as shown in the example of the cargo-update repository) and should be taken as a pattern that matches all files, which subsequent ! exclusions can then make exceptions to. A / might likewise be used deliberately, though I'm not sure I have seen it outside of testing. Furthermore, the ability to delete nested untracked repositories is a deliberate feature of gix clean when passed some combinations of options, and a very useful feature at that. However:

  1. The top-level directory of a repository, i.e., the entire repository, should not be deleted in a clean. This applies to the case of running gix -r ... clean -xde. This would apply even if the directory were actually empty due to a nonstandard worktree. (This -r is a gix option and should not be confused with -r as a gix clean option as presented below in case 4.)
  2. Tracked files should not be deleted or modified in a clean, regardless of whether they have any (staged or unstaged) changes. This applies to the unexpected deletions other than those of .git.
  3. Neither the current repository's .git directory nor anything inside it should ever be deleted in a clean, even if intentionally cleaning ignored files and even if intentionally including ignored subdirectories that are themselves repositories. Although the current repository's .git directory is "ignored" in the sense that commands like git add . do not stage anything from it, this behavior is separate from, and supersedes, the effect of .gitignore. The .git directory is effectively a void in the working tree.
  4. As a special case of 3, even if .git is specified explicitly in .gitignore and -r is included in the options passed to gix clean, the .git directory should not be deleted. Currently adding -r will make this happen, covered in "Steps to reproduce" below. (This -r is a gix clean option and should not be confused with -r as a gix option as presented above in case 1.)
  5. When the current repository is a submodule, is is expected to have a .git file instead of a .git directory. This likewise should not be deleted when running gix clean in the submodule, irrespective of the options to gix clean or the contents of any .gitignore file. I have not tested this case yet.
  6. When the current repository has submodules--that is, directories corresponding to entries in .gitmodules--those submodules should not be deleted by gix clean, since submodule directories are likewise voids in the superproject's working tree, rather than being ignored due to .gitignore. I have not tested this case yet either.

Cases 2 and 3 are the most important, because they are the most common and they are the most likely to cause data loss, especially case 3.

Although I have not tested cases 5 and 6 (yet), I mention them because it seems like incorrect behavior for them might be easy to bring about by accident when fixing this bug for the other cases.

Notes on -p and -r

One possible implementation approach comes to mind for a fix that I want to recommend against. While gix clean recognizes precious files, I recommend against allowing any of the above to occur even when -p is passed. I think regarding .git as typically ineligible for deletion by automatically considering it a precious directory would still be far from strong enough protection. It also doesn't really fit conceptually: I believe that precious files are conceptually those that should usually not be deleted due to their status as being important for reasons independent of source control.

In addition to the above, the effect of -x and -d on actual nested repositories, especially if they are to continue to delete them under some conditions even in the absence of -r, should be documented explicitly, including in the output of gix help clean. But that could be done separately from the fix for this bug.

Git behavior

No one-to-one comparison...

There is no exact comparison to git clean behavior, because gix clean deliberately deletes ignored nested repositories when -x and -d are passed (provided that -e is passed to allow it do anything at all). It furthermore seems to do so intentionally even without -r, though as examined below in Steps to reproduce, perhaps this is unintentional in the absence of -r.

More broadly, gix clean is not intended to behave exactly the same as git clean, as detailed in #1308.

...but git behavior is relevant

However, it's true that git clean does set strong expectations for what kinds of deletions are within the ambit of cleaning, and no way of using git clean produces this effect.

For example, running git clean -xdf in the cargo-update repository used as an example above does not delete any tracked files or the .git directory.

Steps to reproduce 🕹

To reproduce this, one can carry out the example shown above with the non-toy cargo-update repository, which confirms the practical significance.

Simplified reproducer

One can alternatively run the following commands to reproduce it with a simple repository. These and all commands shown in this section were tested in Ubuntu 22.04 LTS.

git init ignore-star
cd ignore-star
echo '*' >.gitignore
echo '!.gitignore' >>.gitignore
cat .gitignore
git add .
git status
gix clean -xdn
gix clean -xde
ls -al

With full output:

ek@Glub:~/src$ git init ignore-star
Initialized empty Git repository in /home/ek/src/ignore-star/.git/
ek@Glub:~/src$ cd ignore-star
ek@Glub:~/src/ignore-star (main #)$ echo '*' >.gitignore
ek@Glub:~/src/ignore-star (main #)$ echo '!.gitignore' >>.gitignore
ek@Glub:~/src/ignore-star (main #%)$ cat .gitignore
*
!.gitignore
ek@Glub:~/src/ignore-star (main #%)$ git add .
ek@Glub:~/src/ignore-star (main +)$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   .gitignore

ek@Glub:~/src/ignore-star (main +)$ gix clean -xdn
WOULD remove / (🗑️)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip
ek@Glub:~/src/ignore-star (main +)$ gix clean -xde
removing / (🗑️)
Error: Invalid argument (os error 22)
ek@Glub:~/src/ignore-star[1]$ ls -al
total 8
drwxr-xr-x  2 ek ek 4096 Jul 21 19:14 .
drwxr-xr-x 20 ek ek 4096 Jul 21 19:14 ..

Demonstration that this relates to nested repository handling

Some of the above commands are not necessary to confirm the bug but illustrate relevant aspects of it. For example, consider this part of the output of the dry-run gix clean -xdn command run before doing the real clean:

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip

This strongly suggests that the problem relates to the way code in gix::dirwalk identifies nested repositories.

Variation with /

The above commands can be repeated with a / entry instead of * to confirm that it also happens with that. This works both with and without the second line, !.gitignore, since git itself treats a / entry in .gitignore not to cover .gitignore. Actually I am not sure what a / in .gitignore is supposed to do.

Variation with tracked files that are not special

Although the real-world example with cargo-update, as well as the minimal example where .gitignore lists * and !.gitingore, illustrate that files excluded from being ignored by matching a ! pattern that comes after * are not spared, here's a minimal example focused on that:

git init ignore-tracked
cd ignore-tracked
echo $'*\n!.gitignore\n!a' >.gitignore
cat .gitignore
touch a
git add .
git status
gix clean -xdn
gix clean -xde
ls -al

That produces:

ek@Glub:~/src$ git init ignore-tracked
Initialized empty Git repository in /home/ek/src/ignore-tracked/.git/
ek@Glub:~/src$ cd ignore-tracked
ek@Glub:~/src/ignore-tracked (main #)$ echo $'*\n!.gitignore\n!a' >.gitignore
ek@Glub:~/src/ignore-tracked (main #%)$ cat .gitignore
*
!.gitignore
!a
ek@Glub:~/src/ignore-tracked (main #%)$ touch a
ek@Glub:~/src/ignore-tracked (main #%)$ git add .
ek@Glub:~/src/ignore-tracked (main +)$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   .gitignore
        new file:   a

ek@Glub:~/src/ignore-tracked (main +)$ gix clean -xdn
WOULD remove / (🗑️)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip
ek@Glub:~/src/ignore-tracked (main +)$ gix clean -xde
removing / (🗑️)
Error: Invalid argument (os error 22)
ek@Glub:~/src/ignore-tracked[1]$ ls -al
total 8
drwxr-xr-x  2 ek ek 4096 Jul 21 22:05 .
drwxr-xr-x 20 ek ek 4096 Jul 21 22:04 ..

The key seems to be that the top-level directory is taken to match the entry *, causing its contents all to be deleted even if some of them match ! exclusions.

Variation listing .git and passing -r

Making the .gitignore file contain .git does not cause gix clean -xde to delete .git, but it does cause gix clean -xdre to delete .git. (Note that this -r is an option to gix clean and should not be confused with the -r option to gix before a subcommand, which specifies a repository to operate on.) This can be seen by running the commands:

git init ignore-dotgit
cd ignore-dotgit
echo '.git' >.gitignore
cat .gitignore
git add .
gix clean -xdn
gix clean -xde
gix clean -xdrn
gix clean -xdre
ls -al

Here's what that looks like:

ek@Glub:~/src$ git init ignore-dotgit
Initialized empty Git repository in /home/ek/src/ignore-dotgit/.git/
ek@Glub:~/src$ cd ignore-dotgit
ek@Glub:~/src/ignore-dotgit (main #)$ echo '.git' >.gitignore
ek@Glub:~/src/ignore-dotgit (main #%)$ cat .gitignore
.git
ek@Glub:~/src/ignore-dotgit (main #%)$ git add .
ek@Glub:~/src/ignore-dotgit (main +)$ gix clean -xdn
Nothing to clean (Skipped 1 repository - show with -r)
ek@Glub:~/src/ignore-dotgit (main +)$ gix clean -xde
ek@Glub:~/src/ignore-dotgit (main +)$ gix clean -xdrn
WOULD remove repository .git/ (🗑️)
ek@Glub:~/src/ignore-dotgit (main +)$ gix clean -xdre
removing repository .git/ (🗑️)
ek@Glub:~/src/ignore-dotgit$ ls -al
total 12
drwxr-xr-x  2 ek ek 4096 Jul 21 19:51 .
drwxr-xr-x 20 ek ek 4096 Jul 21 19:50 ..
-rw-r--r--  1 ek ek    5 Jul 21 19:50 .gitignore

When should -r affect what happens?

I believe this should not happen even with -r.

In addition, combined with the above results, this raises the question of whether gix clean -xde without -r is actually intended to delete any actually nested repositories:

Other variations with .git and -r

Listing /.git has the same effect as .git at least when the current directory is the top-level directory of the repository.

Listing paths under .git in .gitignore, such as with the line .git/config or /.git/config, fortunately has no effect. It does not appear that a .gitignore entry can cause gix clean with any combination of options to attempt to delete only some files inside .git.

EliahKagan commented 2 months ago

I've edited the description to note that the code I quoted is not necessarily where a change has to be made, since Ignored taking precedence over Pruned does make sense outside of specific cases.

With us being in different time zones, it would have been useful had I contributed some part of the tests or fix between the initial report and now. Unfortunately I was quite distracted and, while I was able to make a bit of progress on other gitoxide-related things, I didn't really come up with anything for this.

Byron commented 2 months ago

Thanks a lot for the incredibly detailed analysis of the problem.

I also feel that ideally, a solution would make such kind of mistakes impossible, but I am also not sure this can happen. More will be known once I dive into the code.

In addition to the above, the effect of -x and -d on actual nested repositories, especially if they are to continue to delete them under some conditions even in the absence of -r, should be documented explicitly, including in the output of gix help clean. But that could be done separately from the fix for this bug.

I'd hope you will be able to submit a PR with the documentation improvements you would like to see - I cannot imagine anyone better suited to make them.

Besides that, I hope that the default messaging around the removal of ignored directories works:

❯ cargo run --bin gix -- -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/gix -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd`
WOULD remove / (🗑️)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip

Even though the example is taken from the buggy version that is unaware this is the root, it does warn that ignored repositories would be removed. Overall, I like the way it guides the user towards adding flags to change the set of would-be-cleaned files.

Regarding /, I think I should add tests around that to see what happens - it's nothing I really considered which makes it likely to exhibit surprising behaviour. In practice as demonstrated here, it seems to be another way of matching all with * though.

Regarding the !<file> not having any effect, I think the real issue here is that it doesn't care about these at all even though it would probably identify them as tracked. The directory walk probably disables listing these, so it won't see them at all. This leaves the issue at incorrectly classifying the root-directory as something that can be deleted. Additional tests around ignoring .git should probably also be added for good measure.

The key seems to be that the top-level directory is taken to match the entry *, causing its contents all to be deleted even if some of them match ! exclusions.

Agreed.

In addition, combined with the above results, this raises the question of whether gix clean -xde without -r is actually intended to delete any actually nested repositories:

The intended behaviour is that it allows itself to remove ignored git repositories which are contained in ignored directories. That's easy to do as it may not even look deeper unless the --skip-hidden-repositories option is provided.

The root cause is probably that it's possible to declare the top-level directory as ignored, which allows contained repositories to be removed as well.

❯ cargo run --bin gix -- -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd --skip-hidden-repositories all -r
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.65s
     Running `target/debug/gix -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd --skip-hidden-repositories all -r`
WOULD remove repository .git/ (🗑️)

This invocation shows the behaviour quite nicely, as it now found the 'hidden' repository itself, and flagged it for deletion as requested. Of course, that shouldn't be possible here. It's notable that this also won't be possible for submodules as these should be identified as 'tracked', but the root of the repository is never tracked.

So I'd think this is a very 'local' bug as it really can only apply to the root of the repository, or so I'd think. Of course, I encourage you to prove that wrong.

EliahKagan commented 2 months ago

In addition to the above, the effect of -x and -d on actual nested repositories, especially if they are to continue to delete them under some conditions even in the absence of -r, should be documented explicitly, including in the output of gix help clean. But that could be done separately from the fix for this bug.

I'd hope you will be able to submit a PR with the documentation improvements you would like to see - I cannot imagine anyone better suited to make them.

Besides that, I hope that the default messaging around the removal of ignored directories works:

❯ cargo run --bin gix -- -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/gix -r /Users/byron/dev/github.com/nabijaczleweli/cargo-update clean -xd`
WOULD remove / (🗑️)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip

Even though the example is taken from the buggy version that is unaware this is the root, it does warn that ignored repositories would be removed. Overall, I like the way it guides the user towards adding flags to change the set of would-be-cleaned files.

Yes. Assuming it is intended that -r not always be required to allow nested repositories to be removed--which I think is confirmed by your subsequent description of its interaction with --skip-hidden-repositories--that output is good. But then the way -r/--repositories is documented, at least in the output of gix help clean (i.e. the documentation here), is misleading:

  -r, --repositories
          Remove nested repositories

This makes it seem not only that -r causes nested repositories to be removed, which is true, but that, without -r, nested repositories are not removed.

I'll think about how to improve that wording and open a pull request to do so.

Regarding /, I think I should add tests around that to see what happens - it's nothing I really considered which makes it likely to exhibit surprising behaviour. In practice as demonstrated here, it seems to be another way of matching all with * though.

As far as how gitoxide currently treats / as a full path, I think so.

But git does not itself treat / as a full path the same way. For example, in the above experiments, listing / did not cause git to regard .gitignore as ignored--when making the / test repository, I didn't have to have !.gitignore, or any other entry appended after /, to successfully stage .gitignore by running git add .. Repeating them now confirms this, and also files with names like a are not ignored by git due to / in .gitignore. For git, I am not sure / as a full entry in .gitignore has any effect at all.

Regarding the !<file> not having any effect, I think the real issue here is that it doesn't care about these at all even though it would probably identify them as tracked. The directory walk probably disables listing these, so it won't see them at all. This leaves the issue at incorrectly classifying the root-directory as something that can be deleted. Additional tests around ignoring .git should probably also be added for good measure.

That makes sense, because if I ignore foo and unignore bar then foo/bar is still ignored due to being in the foo directory.

So I'd think this is a very 'local' bug as it really can only apply to the root of the repository, or so I'd think. Of course, I encourage you to prove that wrong.

I would not consider this to disprove that, but I have found that when the current repository is a submodule with a .git file and its own .gitignore lists .git, deleting the .git file in a submodule does not require -r.

That pertains to "Case 5" in the Expected behavior section of the issue description. I had not tried it at that time. I have tried it out now, by doing a recursive clone of has-submodule and experimenting with .gitignore.

It's notable that this also won't be possible for submodules as these should be identified as 'tracked'

Yes, I do think case 6 is already not a problem, and the experiments I've just done seem to confirm this.

Byron commented 2 months ago

I'll think about how to improve that wording and open a pull request to do so.

Thank you 🙏

But git does not itself treat / as a full path the same way. For example, in the above experiments, listing / did not cause git to regard .gitignore as ignored--when making the / test repository, I didn't have to have !.gitignore, or any other entry appended after /, to successfully stage .gitignore by running git add .. Repeating them now confirms this, and also files with names like a are not ignored by git due to / in .gitignore. For git, I am not sure / as a full entry in .gitignore has any effect at all.

Thanks for researching this. I will add this to the baseline tests (comparing Git results with gitoxide) so that it's clear these match. Once again, I think this case is currently missing.

That makes sense, because if I ignore foo and unignore bar then foo/bar is still ignored due to being in the foo directory.

I mean, this is how gix clean does it - it only looks one dir-entry at a time and asks for whether it's igonred or precious. And I don't know what would happen if /foo\n!bar is used in .gitignore and someone tries to add foo/bar. Probably foo/bar would then be allowed, at least it should be as per the rules. In case you are interested, this should be verifiable with gix exclude query.

I would not consider this to disprove that, but I have found that when the current repository is a submodule with a .git file and its own .gitignore lists .git, deleting the .git file in a submodule does not require -r.

Thank you, that sounds wrong, too, and I will take a look.

EliahKagan commented 2 months ago

To clarify the part about the .git file, the repository is the submodule, and I mean that deleting the repository's own .git file does not require -r. (I was not trying to describe a situation with nested submodules.)

This is likely what you took me to mean, but I just realized that was extremely ambiguous, so I figured I'd clarify just in case.

EliahKagan commented 2 months ago

I'll think about how to improve that wording and open a pull request to do so.

Thank you 🙏

No problem! I'm holding off for a bit in doing so, since #1464 might affect what ought to be said.