NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.5k stars 13.68k forks source link

Amazon-init fails due to OOM - t3a.nano ec2 instance with 512 MB RAM #119760

Open fiksn opened 3 years ago

fiksn commented 3 years ago

Describe the bug

I am trying to use amazon-init with t3a.nano AWS EC2 instances. Problem is that nixos-rebuild switch invoked via amazon-init.nix gets killed with OOM. In user-data I have a configuration.nix file that specifies:

swapDevices = [{ device = "/swapfile"; size = 1024; }]; but there is no way for this to get applied. Of course I could manually connect to the instance and do something like:

fallocate -l 1G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

and afterwards invoke nixos-rebuild switch. However I use an auto-scaling group that automatically provisions a bunch of instances so a manual intervention is not really feasible. Another workaround is probably to use some AMI that i'd manually build (which somehow first enables swap), but i'd rather use the "official" NixOS one.

To Reproduce

Deploy AWS EC2 t3a.nano instance and specify some configuration.nix in user-data. Inspect machine to see that no change was applied. In dmesg you can see a notice about nix-build getting killed.

Same issue probably occurs in other memory constrained environments.

Should not really be relevant but I use Terraform for provisioning and https://github.com/tweag/terraform-nixos/tree/master/aws_image_nixos to get AMI ids.

Expected behavior

amazon-init should preferrably check whether available memory <= 512 MB and automatically create some temporary swap space just to make nixos-rebuild switch go through. Possibly check that there is no swap partition already and if that is the case create a temporary named 1 GB file inside /tmp and remove it upon script termination.

Notify maintainers

? @urbas

Metadata Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.74, NixOS, 20.09.1632.a6a3a368dda (Nightingale)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20201102_550e11f`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
urbas commented 3 years ago

I have a similar use case and have hit the same problem.

I am currently working on a slightly different solution though. I'm working on extending the amazon-init module so that it'll also support shell scripts in user-data.

My current idea is to check whether user-data starts with a shebang #! line. If so, amazon-init would just run that script, if not it'll fall back to the old behaviour.

You can create the swapfile in that script.

Edit: here's the current prototype: https://github.com/urbas/nixpkgs/commit/db5b547b2542d01661ad602b437d88e3c75a8606

fiksn commented 3 years ago

Yeah, that is a good way too, what worries me is just too much complexity for the user-data file. Already the triple ### is a hack, but I understand there is no other easy alternative. If I got nix to execute it would be in principle possible to execute arbitrary code anyway, the other way around (shell script that writes configuration.nix I suppose?) will be more brittle, albeit more generic.

I know this wasn't brought up yet, but just in case somebody mentions it: I'd be more than happy to use t3a.micro or better, but unfortunately my use-case requires (a lot of) nano instances. Using bigger machines will make it cost prohibitive very quickly. (Besides I feel this is a bug since NixOS AMI should be working with any kind of instance type in theory.)

fiksn commented 3 years ago

What I'd do is something like this https://github.com/fiksn/nixpkgs/commit/7cac000e2dddc0e835b5054d9f540c1bc8fa45f8. (Open for improvement suggestions). I feel there should be some test for this, but not sure how to even approach it. Will try to build an AMI using nix-build . -A config.system.build.amazonImage --argstr system x86_64-linux.

Edit: Had some trouble with this, nix-build was segfaulting on my machine but I was able to workaround with GC_DONT_GC=1 as suggested in https://github.com/NixOS/nix/issues/4246#issuecomment-759300553

urbas commented 3 years ago

I would vote for the more generic solution (supporting a shell script). It happens to also solve my problem, where I use nixos configuration sources from S3. Here's the the user-data I use and would like support for:

#!/usr/bin/bash

# prepare swap
fallocate -l 1G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

# apply nixos configuration sourced from S3
outDir=/var/lib/nixos-config/
mkdir -p $outDir
aws s3 cp s3://${aws_s3_bucket_object.nixos-config-src.bucket}/${aws_s3_bucket_object.nixos-config-src.key} - | tar -C $outDir -xzf -
NIX_PATH=nixos-config=$outDir nixos-rebuild switch
fiksn commented 3 years ago

I don't think the two approaches are mutually exclusive. It's great to have support for shell scripts. But I wouldn't overload somebody who just wants to provision a t3a.nano AWS NixOS instance with "hey, you need to make sure to enable swap since nix needs a lot of memory". It should just work as advertised or people will switch to something different than NixOS. And yes, I believe most people should be quite comfortable writing such a shell script on their own. This is just about the UX. So I'd say your change offers a workaround that I am willing to try out. But it is not a solution to the problem I have mentioned in general.

Of course this illusion breaks down as soon as the system is installed and nixos-rebuild switch fails next time. But by then it is relatively easy to fix it and more crucially you already have the users commited to NixOS. (Ideally nix utilities should take care to work on memory constrained systems too somehow, but I can imagine that they shouldn't really be doingswapon :) ).

I found a place to put my test https://github.com/NixOS/nixpkgs/blob/master/nixos/tests/ec2.nix, but it seems I can't do even nixos-build ec2.nix - on my machine it fails with ERROR: cptofs failed. diskSize might be too small for closure I have around 20 GB of disk space free and digging around it seems a 10 GB hdd is used by qemu. I know this should probably be a seperate issue. But it is related to writing a unit test for the stuff here, so it'd be great if you could help me out here or at least confirm nixos-build ec2.nix works for you.

BTW, why don't you put the configuration from S3 into /etc/nixos but /var/lib/nixos-config?