NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.5k stars 1.5k forks source link

Garbage collection fails because directory is not empty #11134

Open ejb-11 opened 3 months ago

ejb-11 commented 3 months ago

Description

When running # nix-collect-garbage -d, it fails on a particular store path (/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target), and it gives the following error: error: cannot unlink '/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3': Directory not empty

When I list the contents of this "directory", it gives nearly 100 errors, most of them looking much like this example: ls: cannot access '/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OPENSSL_FILE.3ssl.gz': No such file or directory It does eventually list some "files" inside that appear (running ls -l to see their properties) with all properties unknown, returning '?', e.g.: l????????? ? ? ? ? ? OCSP_sendreq_nbio.3ssl.gz It also claims there are a total of zero files in this phantom directory. All of the phantom files contained seem to relate to OpenSSL, so maybe that has something to do with it, idk.

Steps To Reproduce

I have no clue, but it appears to be within the Steam FHS environment. This has been preventing garbage collection for a while now.

Expected behavior

It should delete/unlink this Schrödinger's directory that seems to exist and not exist at the same time.

nix-env --version output

nix-env (Nix) 2.18.4

Additional context

I run NixOS. I can rebuild and change my system configuration just fine, only garbage collection is impossible due to this problem.

Sorry if this is the wrong place to report this. Thanks in advance for any help.

roberth commented 3 months ago

l????????? ? ? ? ? ? OCSP_sendreq_nbio.3ssl.gz

It looks like your filesystem was corrupted, so I'm not sure that this is solvable in Nix. That's beyond Nix's control, and I don't know how usable your filesystem is, given the state it is in.

I do see a slight possibility that perhaps Nix could try harder to delete directories. My hypothesis is that this man3 directory reports a size of 2 in stat (for . and ..), but nonetheless returns directory entries when you ask for it, as evidenced by your ls call. This is arguably a problem in your file system, but maybe it's recoverable. Could you check these two things? (In this order)

  1. What size does stat /nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3 report?
  2. If you mount the file system of your store read-write at a different mount point, is it possible to remove the file?
    rm /nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OPENSSL_FILE.3ssl.gz

    or the simpler command

    unlink /nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OPENSSL_FILE.3ssl.gz
ejb-11 commented 3 months ago

Thanks for your help. Running stat on the directory as you suggested gave a size of 1036288. I guess it's corrupted. I tried deleting it, but as you pointed out, it is read-only, and I don't know how to mount it as r/w. I have the store in the same partition as the root, and when I run lsblk it shows them both being on the same partition but mounted separately. How would I mount it for read/write in this scenario?

I am using F2FS, which in hindsight was perhaps unwise, but I wanted to preserve my SSD which did not have that many writes of lifespan as described by the manufacturer (it is 2TB in size but only has 440TB write lifespan, so only 220 writes per storage location). Perhaps F2FS has bad error recovery abilities (or lacks them entirely)?

roberth commented 3 months ago

How would I mount it for read/write in this scenario?

I think generally the way to do it is

mkdir /root/repair
mount /dev/something /root/repair
...
umount /root/repair

The idea being that something is the device that contains the store file system, and mounting it again, concurrently, the mount will be read-write.

This doesn't work if the file system itself is read-only, in which case you have to rescue it by booting a USB stick or similar. I don't think that's the situation here.

ejb-11 commented 2 months ago

Thanks a lot. I didn't know the same partition could be mounted in multiple places at once.

I mounted it, and when I tried to remove everything in the folder like so: # rm repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/* and I got rm: cannot remove 'repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OCSP_sendreq_nbio.3ssl.gz': No such file or directory for each of the "files" supposedly contained therein.

Likewise, I tried running # rm -rf repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3 but got rm: cannot remove 'repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3': Directory not empty in response, so it doesn't seem to respect the -rf option which should make it ignore everything and just delete the directory.

When I tried # unlink repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/* I got an odd response: unlink: extra operand ‘repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OCSP_sendreq_new.3ssl.gz’

So I tried just unlinking one file, which also didn't work: # unlink repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OPENSSL_FILE.3ssl.gz unlink: cannot unlink 'repair/nix/store/c28ybd0ja4i8rh78svzhqvrqpnqzk44f-steam-usr-target/share/man/man3/OPENSSL_FILE.3ssl.gz': No such file or directory

I might just reformat my drive to use ext4 and rebuild my system from my configuration.nix, as I read up and found that some people had issues with f2fs and NixOS. Thanks again for spending your valuable time to help me troubleshoot this problem.

L-as commented 2 months ago

I am using F2FS, which in hindsight was perhaps unwise, but I wanted to preserve my SSD which did not have that many writes of lifespan as described by the manufacturer (it is 2TB in size but only has 440TB write lifespan, so only 220 writes per storage location). Perhaps F2FS has bad error recovery abilities (or lacks them entirely)?

Did you enable all the checksum stuff when creating the filesystem? I used to have (minorish) corruption issues on hard crashes, but I don't any longer. Are you even fsck-ing it? Any time I've had issues it just failed at the fsck-phase, saying it couldn't unf*ck it.

I've been using f2fs for a very long time on NixOS without issues.