NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.65k stars 1.51k forks source link

nix copy uses too much memory #1681

Open LisannaAtHome opened 6 years ago

LisannaAtHome commented 6 years ago

I'm running nix copy in runInLinuxVM, and notice that for any nontrivial closures, the VM will run out of memory during the copying process. I left it set at the default 512 megabytes. I could obviously increase the amount of memory the VM is given, but that doesn't scale for copying complex derivations with many dependencies.

I suggest adding an option to only load and copy the contents of the paths one at a time, or even better, a way to specify an upper bound on the memory to be used while copying.

copumpkin commented 6 years ago

Intuitively it feels that it should be possible for it to run in constant memory. What am I missing?

lheckemann commented 6 years ago

I'm encountering this issue with a single path — nix copy, nix-store --import, and a number of other commands I've tried all fail to import the path. Would be great to know if there's any way at all I can import it…

LisannaAtHome commented 6 years ago

Possibly related to https://github.com/NixOS/nix/issues/1969 ? Looks like some patches have gone in recently that might improve things here: https://github.com/NixOS/nix/commit/48662d151bdf4a38670897beacea9d1bd750376a https://github.com/NixOS/nix/commit/3e6b194d78024373c2320f31f4ba0de3d0658b83

Ralith commented 6 years ago

I see commits purporting to address this for a number of different cases, but none concerning uploading to a S3 bucket. Trying to copy a 2.8GB store path to a S3 bucket took nearly 4GB of memory and more than twenty minutes of 100% CPU. Has that been fixed?

andrewchambers commented 6 years ago

Hitting this issue trying to do something like - nixos-rebuild build ; nix copy ./result --to ssh://low_ram_machine

@dtzWill will those experimental changes help with ssh copy?

edolstra commented 6 years ago

@Ralith I'm probably not going to make S3BinaryCacheStore do uploads in constant space. It might not even be supported by aws-sdk-cpp.

I assume the 100% CPU is caused by compression, which you can disable.

copumpkin commented 6 years ago

FWIW I too am another big-upload-to-S3 guy using nix copy 😄

It would surprise me if aws-sdk-cpp didn't support it, given that S3 supports almost arbitrarily large objects and multi-part uploads. If someone figured out how to implement it, would you accept the PR?

Ralith commented 6 years ago

I assume the 100% CPU is caused by compression, which you can disable.

It seems very strange that it would take twenty minutes on my i7-4980HQ, even so. 2.8GB is big but it's not that big.

edolstra commented 6 years ago

IIRC xz compression can easily take that long.

coretemp commented 6 years ago

This is what I am seeing too:

a...........> copying path '/nix/store/fl3mcaqqk2vg0dmk01dfbs6nbm5skpzc-systemd-237' from 'https://cache.nixos.org'...
a...........> error: out of memory

The main problem I see is that it merely says "out of memory", instead of saying how much it tried to allocate, and how much was available before the allocation in the error message. Copying data should run in constant space as others have already mentioned.

If the compression is causing higher memory requirements than needed, this is a problem too, because it raises the hosting costs for no reason other than the initial deployment.

Before the deployment at least 300MB was available on host a.

dtzWill commented 6 years ago

FWIW it looks like they do support streaming at least for fetches:

https://sdk.amazonaws.com/cpp/api/LATEST/index.html

(Near end, look for IOStreams).

Hopefully upload has similar.

Seconded re:xz compression taking that long. There's an option somewhere to enable parallel xz compression is you have idle cores. IIRC the result will be slightly bigger for the same compression level.

Anyway, if someone tackled the API spelunking would it be welcome? Or is there a reason that will have problems or is a bad idea?

EDIT: oops I think we already use the stream thing, although at a glance it looks like we pull it all into a string but that seems resolvable. Anyway fetch from s3 is probably not as important.

lheckemann commented 6 years ago

As far as I can tell, the fixes in 2.0.1 still don't really fix the issue.

edolstra commented 6 years ago

@lheckemann IIRC we didn't cherry-pick any memory improvements in 2.0.1. You need master for some of the fixes or my experimental branch for the rest.

lheckemann commented 6 years ago

Oh, that would explain it! Any chance they could be included in a 2.0.2 release? There have been so many complaints about this issue on IRC and I've run into it myself more times than I would like as well.

SebastianCallh commented 6 years ago

Does "nixops deploy" use this? I get out of memory during deploy, even though I have several gigabytes free (both on disk and working memory) which is odd. Just wondering if this is addressed here or should be investigated further.

coretemp commented 6 years ago

@SebastianCallh you are not specifying which machine goes out of memory, so I assume you don't know it's talking about the machine you are deploying to. The solution to this is to use 512MB of swap.

Perhaps I might commit some of my changes to fix this in an AWS environment when t2.nanos are being used, but only if there is interest in them from people with commit access.

SebastianCallh commented 6 years ago

@coretemp That was the machine I was referring to. The machine being deployed too has plenty of both disk and working memory to spare when the error occurs.

nh2 commented 6 years ago

@edolstra Can you put in here a summary of what store code paths you have already fixed in master, which ones you have fixed on your experimental branch (which one is that), and which ones are known to not work yet?

That would help a lot to figure out what exactly to test.

nh2 commented 6 years ago

I'm on the latest nix commit 54b1c596435b0aaf3a2557652ad4bf74d5756514 which includes a couple memory fixes from the last days that are not yet in 2.0.4. But it doesn't work for me yet:

nixops deploy to a libvirtd VM (which has said latest nix) still fails with error: out of memory, even if I give the VM 2 GB ram, during the step copying 414 missing paths (5083.46 MiB) to ‘root@192.168.123.41’..., where it copies up the paths via SSH.

I can see in top/ps aux how the memory usage of nix-store --serve --write grows and grows up to 50% and then it crashes.

Here is a gdb dump of where it is while the memory is growing:

Thread 1 (Thread 0x7fc827166000 (LWP 917)):
#0  0x00007fc825750b1d in read () from target:/nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#1  0x00007fc8262a1269 in nix::FdSource::readUnbuffered(unsigned char*, unsigned long) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#2  0x00007fc8262a04dd in nix::BufferedSource::read(unsigned char*, unsigned long) () from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#3  0x00007fc8265cbc54 in nix::TeeSource::read(unsigned char*, unsigned long) () from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixstore.so
#4  0x00007fc8262a0a88 in nix::Source::operator()(unsigned char*, unsigned long) () from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#5  0x00007fc82627d294 in nix::parse(nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#6  0x00007fc82627db20 in nix::parse(nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#7  0x00007fc82627db20 in nix::parse(nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#8  0x00007fc82627e683 in nix::parseDump(nix::ParseSink&, nix::Source&) () from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixutil.so
#9  0x00007fc8265ca7c8 in nix::Store::importPaths[abi:cxx11](nix::Source&, std::shared_ptr<nix::FSAccessor>, nix::CheckSigsFlag) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixstore.so
#10 0x000000000041ecec in opServe(std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) ()
#11 0x000000000041822a in std::_Function_handler<void (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#12 0x00007fc8268e09c3 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/s7fqa57f3z7p2wrimir3mz6wybqc0xfq-nix-2.1pre6148_a4aac7f/lib/libnixmain.so
#13 0x000000000040c49a in main ()

Update: For your reading convenience


#0  read ()
#1  nix::FdSource::readUnbuffered       (unsigned char*, unsigned long)
#2  nix::BufferedSource::read           (unsigned char*, unsigned long)
#3  nix::TeeSource::read                (unsigned char*, unsigned long)
#4  nix::Source::operator()             (unsigned char*, unsigned long)
#5  nix::parse                          (nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#6  nix::parse                          (nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#7  nix::parse                          (nix::ParseSink&, nix::Source&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#8  nix::parseDump                      (nix::ParseSink&, nix::Source&)
#9  nix::Store::importPaths [abi:cxx11] (nix::Source&, std::shared_ptr<nix::FSAccessor>, nix::CheckSigsFlag)
#10 opServe                             (std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)
#11 std::_Function_handler<void         (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#12 nix::handleExceptions               (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>)
#13 main ()

maybe this can help figure out whether this code path should already have been moved to a streaming approach?

nh2 commented 6 years ago

I think the issue is here: https://github.com/NixOS/nix/blob/54b1c596435b0aaf3a2557652ad4bf74d5756514/src/libstore/export-import.cc#L72-L98

TeeSink is someting that writes a copy of all the input data read to a std::string data; parseDump(tee, tee.source) does the reading.

Then all that data is added to the store with addToStore(info, tee.source.data, ...) but in between it's all in memory.

nh2 commented 6 years ago

PR that should fix it for nix-store --import, with reduction from 4GB -> 12 MB ram for a 2 GB cudatoolkit closure: https://github.com/NixOS/nix/pull/2206

With convenience to try the patch out: https://github.com/NixOS/nix/pull/2206#issuecomment-394130511

edolstra commented 6 years ago

@nh2 The master branch now has all fixes except https://github.com/edolstra/nix/commit/c94b4fc7ee0c7b322a5f3c7ee784063b47a11d98 because it's controversial.

domenkozar commented 6 years ago

Hopefully we can include this for Nix 2.0.5 :)

edolstra commented 6 years ago

@domenkozar Probably better to do a 2.1 release.

joepie91 commented 6 years ago

@nh2 I can confirm that that patch solved the problem for me (and has not produced any unforeseen issues, as far as I can tell).

EDIT: Whoops, I meant to post this on NixOS/nixpkgs#38808 instead. NixOps is the context in which I'm seeing this problem.

dtzWill commented 6 years ago

Please don't release too many of the recent memory fixes until we've fixed #2203--apologies if the proposed changes don't depend on the bits that broke nix log usage for paths built by hydra. Just don't want to accidentally end up with a release with such a regression :).

coretemp commented 6 years ago

edolstra/nix@c94b4fc is only controversial, because it raises the cost of cloud resources without a good reason in many cases.

If it would simply inspect the machine to check how much storage is available and/or memory is available, it could use that as a default solution.

Another solution is that it would just run until it would go out of memory and then try again with the optimization applied automatically. This way it will always work. For the people that want to optimize the last bits of performance, you could add variables (like already exist, but probably need better names) to control this behavior. By using this design, everyone would be happy. Similarly, you could have flags that optimize for deployment time (e.g. waste more cloud resources to optimize for developer time).

As a guiding principle, I would like to see acknowledged that increasing cloud resource cost is weighed heavily in implementation decisions.

In general, even if you don't implement exactly one of the suggestions above, it is likely possible to create something non controversial. The problem with the existing patch is that the variable one can control is an implementation detail, not a high level policy.

vcunat commented 6 years ago

The OOM condition is rather hard to handle, as it depends on the host OS. Typically it will let you allocate too much and then invoke an OOM killer later, so you don't have the option to react to the condition nicely.

vaibhavsagar commented 6 years ago

Has this been fixed in Nix 2.1?

coretemp commented 6 years ago

Why is this critical issue not being addressed?

nh2 commented 6 years ago

Has this been fixed in Nix 2.1?

@vaibhavsagar I think so.

Why is this critical issue not being addressed?

@coretemp It was addressed in https://github.com/NixOS/nix/commit/2825e05d21ecabc8b8524836baf0b9b05da993c6.

coretemp commented 6 years ago

@nh2 Yes, I figured that out a few days ago, because I thought it wasn't closed for nothing. I think it's sloppy and slightly rude to close an issue without referring to a commit. It is rude, because it says to users (which are almost all software developers) "My time as a developer is more important than the time of N typically highly skilled software developers". It does not compute, I can tell you that. The fact that the software is provided for free does not change the economics. If every developer in the project would behave that way, you can see how efficiency goes down. So, there's my proof that it's rude and inefficient behavior.

In case you are wondering, I also considered to share the same information you shared, but because of the negative atmosphere I chose against it.

I am using the feature and while I have not tested the failure scenario myself (a from scratch deployment used to show the problem), it does feel faster.

I don't really like people using negative emoticons. Clearly, this was an important issue for everyone and if no response has been given in 12 days to a question (from someone else even) it doesn't seem as if anyone cares.

Additionally, I provided design feedback, which may or may not have been included in the final version (which seems to have taken that into account). As such, I would like to ask everyone to stop talking in negative emoticons. If you have something to say, just say it.

I can share the expression we are using (which compiles a version with these features from source), but I wasn't able to get the overlay version of it working in 5 minutes, which I imagine would be what the rest of you is using.

@nh2 Thank you for giving the right example, though.

domenkozar commented 6 years ago

@coretemp please behave with respect and avoid Ad Hominem, as it does no good to anyone.

Nix is provided for free and comes with zero obligations from developers. If you'd like professional support, I'd recommend contacting some of the consulting companies: https://nixos.org/nixos/support.html

I'm locking this issue as nothing good can come out of this, if there's an issue with the recent fix, please open another issue describing the problem.

domenkozar commented 5 years ago

coretemp was banned since, so we can unlock.

nh2 commented 5 years ago

I have backported @edolstra's memory fixes to Nix 2.0.4 (because I'm still using that in one place):

https://github.com/NixOS/nix/compare/2.0.4...nh2:nh2-2.0.4-issue-1681-cherry-pick

Note this fixes the case where the machine that's running nixops runs out of memory.

nh2 commented 5 years ago

I think this issue is solved in Nix 2.2 at least for my use cases (given that my ram problems in nixops disappear in my backport, including #38808).

But it would make sense to ask around among the subscribers to this issues if you have observed any further nix copy or nix-copy-closure related memory problems since these commits landed.

If not, we can probably close this.

(There is still #2774 which says that 2.2 is used and which is relatively recent.)

So, does anybody here still have memory problems with current nix?

AleXoundOS commented 5 years ago

So, does anybody here still have memory problems with current nix?

I have. I'm the author of #2774. And even slowly started to write my own solution to the problem of downloading binary cache (using a reasonable amount of RAM). Also, here at my work, the lack of a ready to use mirroring solution is the main issue that currently prevents our company from using NixOS. Since no internet connection possible and everything needs to be downloaded beforehand.

lordcirth commented 5 years ago

I found this issue just now. I just ran nix-env -u on an Ubuntu 18.04 system, and it rose to ~2.4GiB of memory over a few seconds before exiting without changes. Nix 2.2.1, nixos-unstable channel (I did nix-channel --update just before). While this system has 16GiB and can handle this, I have machines with 4GB RAM that I'd like to install NixOS on.

zimbatm commented 5 years ago

@lordcirth your comment is out of topic, this issue is specifically about nix copy taking a lot of memory. The issue with nix-env -u is that it's evaluating all of nixpkgs and is also taking a lot of RAM, but for different reasons.

tazjin commented 5 years ago

So, does anybody here still have memory problems with current nix?

Yes, on Nix 2.2.2 I'm still seeing several GB of memory usage when substituting large paths (e.g. GHC) from a cache (as part of a larger build). This is problematic for running Nixery on something like Cloud Run where memory is hard-capped at 2GB.

I haven't yet tried this with 2.3 to see if it makes a difference, but it's on the todo-list.

Edit: I won't be able to test this with 2.3 easily, as it no longer works in gVisor even with my SQLite connection patch. Might get around to more advanced debugging during the weekend ...

nagisa commented 4 years ago

I have observed this when copying a locally built output to a http cache:

nix copy --to 'http://localhost:3000' /nix/store/HASH-NAME-v0.1.0 --option narinfo-cache-negative-ttl 0 --option narinfo-cache-positive-ttl 0

and have observed nix copy to consume approximately the same amount of memory as data copied. That is, the output as reported by nix-copy was 8G for all the outputs it copies and I have seen nix-copy process to consume approximately as much.

The memory usage slowly but surely rises towards that number (and never goes down) as nix copy is compressing outputs.

nagisa commented 4 years ago

I think what happens here is that nix copy stores the compressed result in the memory and then sends it all out in one go rather than streaming the data out as it compresses the nar.xz.

EDIT: nix version 2.3.1

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

AriFordsham commented 3 years ago

Is there a plan to fix this for nix copy?

Ericson2314 commented 3 years ago

I think it is fixed on master.

AriFordsham commented 3 years ago

@Ericson2314 I mean the specific issue of copying --to file://, as documented in #2774 . It doesn't seem to be fixed, even on master - I have recorded my measurements there.

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixosstag.fcio.net/t/code-of-conduct-or-whatever/2134/21

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2-nov-2021-discourse-migration-to-flying-circus/15752/7

stale[bot] commented 2 years ago

I marked this as stale due to inactivity. → More info