Memory usage in eval - Githubissues

SaltyKitkat commented 1 year ago

Just eval my nixos profile takes about 1G ram. It's kind of too much for me. And when running something like nixpkgs-review, nix will just take more and more and more ram.

Is this by design?

Or is there any way I can reduce the memory usage?

❯ time -v nix eval --raw .#nixosConfigurations.SaltyKitkat.config.system.build.toplevel
/nix/store/v0dh21kn18a74d6gk6ayvcawprcywd65-nixos-system-SaltyKitkat-23.11.20230629.4bc72ca Command being timed: "nix eval --raw .#nixosConfigurations.SaltyKitkat.config.system.build.toplevel"
    User time (seconds): 5.28
    System time (seconds): 0.75
    Percent of CPU this job got: 77%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.77
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1046296
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 2
    Minor (reclaiming a frame) page faults: 270506
    Voluntary context switches: 43679
    Involuntary context switches: 152
    Swaps: 0
    File system inputs: 123200
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

nix-env run by nixpkgs-review

Command being timed: "nix-env --extra-experimental-features no-url-literals --option system x86_64-linux -f /home/***/.cache/nixpkgs-review/rev-0df1938e62e6084894afab9846e5a842e0091833/nixpkgs -qaP --xml --out-path --show-trace --no-allow-import-from-derivation"
    User time (seconds): 80.84
    System time (seconds): 3.40
    Percent of CPU this job got: 89%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:34.38
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 10705384
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 18322
    Minor (reclaiming a frame) page faults: 3116626
    Voluntary context switches: 1565
    Involuntary context switches: 884
    Swaps: 0
    File system inputs: 41600
    File system outputs: 40
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

roberth commented 1 year ago

We are aware that Nix evaluation tends to consume significant amounts of memory. Causes and potential causes I'm aware of

in the evaluator and CLI:
- closures reference their whole scope, even when parts of it aren't referenced
  - https://github.com/NixOS/nix/issues/8285
- https://github.com/NixOS/nix/issues/5200
- possibly our coroutine solution might make the gc overly conservative at times. I don't expect this to be significant, but it has not been quantified.
- conservative gc without compaction (also not quantified)
in the expressions:
- overridability means that we have to hold on to non-final values, which adds some overhead
- NixOS' current architecture doesn't scale to a large number of "built in" services; see https://github.com/NixOS/rfcs/pull/22

jsoo1 commented 1 year ago

I want to add the boehm garbage collector is a conservative collector that does not allow heap compaction.

I was hoping to spark some interest in assessing the mark-region algorithm as a possible new garbage collection algorithm for nix because it allows for heap compaction. There are some existing implementations in rust (immix) and c (whippet). In particular the whippet implementation seems relevant to nix because it has zero dependencies and has boehm-compatible api.

roberth commented 1 year ago

@jsoo1 Interesting! Would you be interested in giving whippet a try? I've added notes about gc.

jsoo1 commented 1 year ago

Would you be interested in giving whippet a try? I've added notes about gc.

@roberth sweet! Yes I would be interested! I was planning on setting aside some time for it if there seemed to be interest from the team.

roberth commented 1 year ago

Let's move the discussion of replacing the GC over to https://github.com/NixOS/nix/issues/8626

SaltyKitkat commented 1 year ago

Thanks for the summary！

Since there's already memory leaks, I'm wondering if the gc is working as expected and maybe just improve the gc makes no sence if the most memory usage is by the leaked memory.

roberth commented 1 year ago

I don't expect the GC itself to be broken, and I don't expect many leaks from it being conservative either. It manages to collect an amount about equal to the final heap size in a typical evaluation by ofborg (ie half of allocations are collected). It is hard to know how much it should be able to collect though. So that makes your question a good one, which could perhaps be answered with a combination of profiling and debugging, although we might need custom tooling to really start relating expressions to the heap and gc.

majewsky commented 9 months ago

I ran into this while upgrading from NixOS 23.05 to 23.11 on my cloud VM with 2G of RAM. nix-build itself took 1G of that, and also there were some server services running, taking up about 500M, leaving only 500M for the actual derivation builds. Naturally it OOM'd kind of a lot.

I worked around that by taking the derivation file paths from the these NNN derivations will be built: output, pasting that into a file and running xargs -n1 nix-build < derivations.txt. Not sure if the -n1 also helped, but it feels like some gains could be had here by separating the two phases. I will happily be corrected if I'm working off incorrect assumptions, but it appears to me that the memory usage of nix-build is all related to Nix expressions, which at this point in the build process are entirely unneeded, since all the required information exists in the .drv files. Maybe the Nix expression evaluation could happen in a separate process that then terminates before nix-build moves on to building the derivations, or the Nix expressions could be allocated in an arena that is freed all at once after evaluation is done, or something like that?

That would not solve the original problem, and looking into a different GC still sounds valuable, but it might make the problem less acute for a portion of affected users.

roberth commented 9 months ago

Regarding freeing the expressions, a starting point would be https://github.com/NixOS/nix/pull/5747#issuecomment-1615939700, but also making sure to destruct EvalState and the expression cache.

If you have really small machines to deploy to, you might want to use nixos-rebuild --target-host. That will neither build nor evaluate on the target machine.

majewsky commented 9 months ago

nixos-rebuild --target-host is a good hint and I will take that under consideration. But for what it's worth, that does not solve OOM during auto-upgrades as triggered by system.autoUpgrade.enable = true; as far as I can see.

thkoch2001 commented 9 months ago

CC @astro FYI

While learning nix and nix flakes, this command freezed my dear and at that point mostly idle 16GB laptop, eating >10GB:

nix flake show microvm

shortened output:

github:astro/microvm.nix/7bd9255e535c8cbada7f574ddd3bcf3bfa5e1eae                                                                                                             
├───apps                                                                                                                                                                      
│   ├───aarch64-linux                                                                                                                                                         
│   │   ├───graphics: app                                                                                                                                                     
│   │   ├───qemu-vnc: app                                                                                                                                                     
│   │   ├───vm: app                                                                                                                                                           
│   │   └───waypipe-client: app                                                                                                                                               
│   └───x86_64-linux                                                                                                                                                          
│       ├───graphics: app                                                              
│       ├───qemu-vnc: app                                                                                                                                                     
│       ├───vm: app                                                                                                                                                           
│       └───waypipe-client: app                                                                                                                                               
├───defaultTemplate: template: Flake with MicroVMs                                                                                                                            
├───hydraJobs                                                                                                                                                                 
│   ├───aarch64-linux                                                                  
│   │   ├───cloud-hypervisor-overlay-shutdown-command: derivation 'microvm-test-shutdown-command'
[...SNIP...]
│   │   └───vm-stratovirt-iperf: derivation 'vm-stratovirt-iperf'                                                                                                             
error: interrupted by the user                                                                                                                                                
nix flake show microvm  58,38s user 4,46s system 92% cpu 1:07,85 total

The output is actually from a run after I found https://github.com/rfjakob/earlyoom - You might want to recommend this nice tool somewhere!

Please don't get this issue site tracked by me. I just thought it might be interesting to mention earlyoom in this issue and have an example on how to reliably eat a lot of memory.

roberth commented 9 months ago

https://github.com/NixOS/rfcs/pull/163 may reduce memory use for NixOS, by virtue of not having to load service modules that aren't used.

It's one solution among potentially others, such as #9650 for cases like show microvm.

blitz commented 2 months ago

Any memory usage improvements are very welcome. My CI runner with 16 GB RAM now also occasionally triggers the OOM killer when evaluating my NixOS configurations.

ciacon commented 2 months ago

I seem to be be encountering this too. A nix flake show in the microvm repo consumed a whopping ~24G of RAM.

JohnRTitor commented 3 days ago

I am encountering this as well, Nixpkgs-review when evalling sometimes fills up my whole RAM (16Gigs), before the usage is like 5Gigs, smh.

NixOS / nix

Memory usage in eval #8621