Open aciceri opened 2 years ago
@thufschmitt You mentioned this same problem almost two years ago. Could you confirm this is still a bug and it doesn't depend on my particular configuration or derivations, please?
Forgive me for directly pinging you but I need to be sure that this is currently broken.
Moreover, if I repeat nix build
several times in the end I get my derivation built. Waiting for a fix, how bad is this? Does it implies a "rotten" output or is as it should be despite errors?
I would like to help but this is my first time trying to read Nix's source code and I fear this bug isn't easy to fix. I can't even imagine what is causing it.
@aciceri I have these occasionnally (but might not be the same cause as the original one. It mostly disappeared until a couple of months ago). But I couldn’t manage to reproduce it in a deterministic-ish way :(
Moreover, if I repeat nix build several times in the end I get my derivation built. Waiting for a fix, how bad is this? Does it implies a "rotten" output or is as it should be despite errors?
I don’t think so. I think the error is “just” that some fds get (in a totally not deterministic fashion) closed too early or too late, but things seem to work well when it doesn’t happen
I'm getting this error quite regularly. I have the ngi0 cache enabled, which I think may be causing this issue to occur more often.
I also see this at one place in the journal:
nix-daemon[3456058]: terminate called after throwing an instance of 'nix::SysError'
nix-daemon[3456058]: what(): error: closing file descriptor 7762532: Bad file descriptor
nix-daemon[3457576]: corrupted double-linked list
nix-daemon[3508170]: corrupted size vs. prev_size while consolidating
Is there a way to enable more debug logging to maybe catch more of what's happening internally? I get this error quite often (especially now I'm building a lot of thing), so I'd like to help with figuring out where the issue lies.
nix --version
nix (Nix) 2.9.0pre20220512_d354fc3
@Mindavi I can't see any connection with the ngi0 cache to be honest, what do you mean? These errors happen during the building of derivations, the more things it can fetch from the cache and the lower the chances of these errors occurring are.
However I would really like that ca-derivations would work too and I'm available to work on this but this is the first time I put my hands on nix
source itself. @thufschmitt What do you recommend? Which preliminary readings to really understand how ca-derivations work at a low level? Which files in the source are really involved? Is there a way to better debug what is happening?
It also seems to happen without the ngi0 cache: https://github.com/helsinki-systems/harmonia/runs/7183636987?check_suite_focus=true
This also seems to happen without using content-addressed derivations.
I can reliably reproduce this when using ca-derivations. Is there any way I can debug this? This is quite annoying.
This now happens to me repeatedly when not using CA derivations (when building stuff) and is really annoying, although probably difficult to debug, as there is no information about what type of file descriptor has this "use-after-free" like problem. Maybe such diagnostic should be added.
Just removing close
calls one by one could tell you at least which one it is.
that will just cause nix to quickly run out of file descriptors, I suppose (given it opens thousands of them in mere seconds regularly)
Note: I've successfully worked around this by doing ulimit -n $((1024 * 1024))
, note, you need to increase the hard limit in your NixOS config like so (IIRC):
{
systemd.extraConfig = "DefaultLimitNOFILE=1048576";
}
Seems like the default of 1024 is causing issues, but I'm not sure it's worth fixing if just increasing it fixes it.
@thufschmitt
It doesn't seem to solve the problem entirely unfortunately. Perhaps it needs to be a very high value to avoid it.
Can confirm that even outside of this bug Nix frequently runs into problems with a low ulimit -n
Still seeing this, maybe it helps that I'm now building with ubsan enabled.
I demangled some of the symbols from this and it seems to be pointing to somewhere here:
I have been working on debugging this, but haven't been able to reliably reproduce it. I got the error myself quite a few times when I first enabled CA derivations, but now it's not happening at all. Anyone who is still consistently getting the error: can you make a small example flake that is (somewhat) reliably triggering it? (with nix-store --delete
in between invocations, presumably)
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/content-addressed-nix-call-for-testers/12881/217
Describe the bug
When I try to build ca derivations I get sporadic errors about "bad file descriptor"s. Sometimes they are built and sometimes not.
Steps To Reproduce
Same command again gives another output
I've built nix from master (the latest commit) but I had the same problem with current
nixos-unstable
's nix (2.8) and even withnixUnstable
on stable (2.5)Additional context
Sometimes I get "core dumped" in the nix daemon logs but not necessarily.