NixOS / amis

Home for NixOS AMI automation
https://nixos.github.io/amis/
MIT License
48 stars 6 forks source link

[BUG] EC2 AMIs do not start user-data provisioning or amazon-init.service #137

Open hjkatz opened 5 months ago

hjkatz commented 5 months ago

While launching 23.11 AMIs in the us-west-2 region we observe that not every instance launched starts the amazon-init.service nor seems to set the key pair properly for the root user.

We are presented with the following:

Warning: Permanently added '35.95.86.153' (ED25519) to the list of known hosts.
(root@35.95.86.153) Password:

This happens ~70% of the time, other times the instance comes up just fine.

AMI ID: ami-0f1471461676aaa00

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

Another note: Rebooting the instance a few times seems to get past the problem.

Observed Heuristic: If the ssh keys are not set up by like ~30s after boot, then the instance never comes up properly at all.

arianvp commented 5 months ago

first of all could you check if the same problem occurs on 24.05beta? https://nixos.github.io/amis/

23.11 will be EOL in a few days

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

This part I don't understand. amazon-init.service has nothing to do with boot logs in the AWS console. Those are collected directly from the serial console.

However in my experience those are quite eventually consistent. It can takes minutes before they populate (Even on non-NixOS)

A better way is to directly connect to the serial console instead in the AWS Console or through the aws cli

arianvp commented 5 months ago

Note that the ssh key is set up by

https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-data.nix#L21 not amazon-init.service

Both are ordered after but have no strict requirement https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/amazon-image.nix#L72

So I think what might be happening is that https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-metadata-fetcher.sh is failing for some reason.

arianvp commented 5 months ago

the output of the ec2-metadata-fetcher.sh is posted to the serial console, so it should be available when connecting to the serial console in EC2 at least.

arianvp commented 5 months ago

To get debug logs programatically (Only works for Nitro instances!) this command can be used:

aws ec2 get-console-output --region us-west-2 --latest --instance-id $id

Output of that command on a failing instance would be very useful.

In the meantime I'm gonna spawn 10 instances and see if I can reproduce

aws ec2 run-instances --region us-west-2 --key-name arian@Arians-MacBook-Pro.local --image-id  ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1  --count 10 
hjkatz commented 5 months ago

first of all could you check if the same problem occurs on 24.05beta? nixos.github.io/amis

23.11 will be EOL in a few days

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

This part I don't understand. amazon-init.service has nothing to do with boot logs in the AWS console. Those are collected directly from the serial console.

However in my experience those are quite eventually consistent. It can takes minutes before they populate (Even on non-NixOS)

A better way is to directly connect to the serial console instead in the AWS Console or through the aws cli

Understood, thank you for explaining.

Yes, I'll give it a try on the new 24.05 beta.

FWIW we did do a test where user-data was this script:

#!/usr/bin/env bash
echo hello

on 23.11 and saw the same issue.

I'll try this out on the newer AMI version 24.05 and see how it behaves.

arianvp commented 5 months ago

I spawned 10 instances in a row and non of this reproduce what you're reporting. Can you provide more info about your deployment?

The commands I used ```bash #!/bin/sh id=$(aws ec2 run-instances --region us-west-2 --key-name arian@Arians-MacBook-Pro.local --image-id ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text) ip=$(aws ec2 describe-instances --region us-west-2 --instance-ids $id --query 'Reservations[0].Instances[0].PublicIpAddress' --output text) sleep 30 ssh root@$ip aws ec2 get-console-output --region us-west-2 --latest --instance-id $id aws ec2 terminate-instances --region us-west-2 --instance-ids $id ```

Did you set a NixOS config in user-data? And What instance type are you using?

hjkatz commented 5 months ago

@arianvp I was able to reproduce with a minimal user-data just echo hello

We're using instance type: c6id.8xlarge in region us-west-2

arianvp commented 5 months ago

Also with that user data I'm not able to reproduce on a t3.nano instance.

id=$(aws ec2 run-instances --region us-west-2 --user-data file://user-data.sh --key-name arian@Arians-MacBook-Pro.local --image-id  ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)

Last thing I can think of. Did you maybe disable IMDSv2 on your instance and are using the legacy IMDSv1?

Otherwise it must be something with that specific instance type. Which is kinda odd.

arianvp commented 5 months ago

So yeh please provide the output of the following command

aws ec2 get-console-output --region us-west-2 --latest --instance-id $id

On a machine that didn't get the ssh key set up properly. Shortly after booting up the instance .

And share it here.

It should give us a clue what is failing.

hjkatz commented 5 months ago

@arianvp Here you go!

No. We're requiring IMDSv2 via the launch template.

Output: https://gist.github.com/hjkatz/d454f2ccb0f86e39217b9293d69dadf8

Command: aws ec2 get-console-output --profile ngrok-infra --region us-west-2 --latest --instance-id i-04e2b3d87804e3a73 | jq '.Output' -r

arianvp commented 5 months ago

Are you sure this is the log of a server that failed to set up?

Because all three services succeeded in that run:

[ OK ] Finished fetch-ec2-metadata.service. (Fetches the IMDS metadata) [ OK ] Finished Apply EC2 Data. (apply-ec2-data.service which applies ssh keys) [ OK ] Finished Reconfigure the system from EC2 userdata on startup. (amazon-init.service user data run)

The only thing fishy in those logs is that it took all 3 retries to fetch the IMDSv2 Token. Though it did succeed on the last try.

[   15.741638] fetch-ec2-metadata-start[1399]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   15.814428] fetch-ec2-metadata-start[1402]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0 ms: Couldn't connect to server
[   16.820736] fetch-ec2-metadata-start[1399]: (attempt 2/3) getting an EC2 instance metadata service v2 token...
[   16.824142] fetch-ec2-metadata-start[1405]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0 ms: Couldn't connect to server
[   17.825681] fetch-ec2-metadata-start[1399]: (attempt 3/3) getting an EC2 instance metadata service v2 token...

I'm wondering if the amount of retries for some reason is not sufficient on your instance type and we need to bump up the amount of retries. And that in 70% of the cases this is failing. However it's super weird that the IMDS is so slow to start up.

E.g. on my t3 instance the IMDS call succeeds immediately:

[   20.686667] fetch-ec2-metadata-start[1003]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   21.219427] fetch-ec2-metadata-start[1003]: (attempt 1/10) validating the EC2 instance metadata service v2 token...
[   21.225272] fetch-ec2-metadata-start[1003]: getting EC2 instance metadata...
arianvp commented 5 months ago

IMDS not being available 16 seconds into the boot sounds like the kind of thing I'd open an AWS Support Ticket for to be honest. This is extremely puzzling if this is indeed what is failing for you.

hjkatz commented 5 months ago

Agreed that IMDS not being available immediately is odd and a huge smell.

Actually after pulling those logs I found out that this server did setup successfully and was launched using my echo hello user-data. So I'm trying again.

Here are the new interesting logs: https://gist.github.com/hjkatz/daa4935f532f8947e0ce48892e3955b9

...
[   28.716698] fetch-ec2-metadata-start[1406]: (attempt 9/10) validating the EC2 instance metadata service v2 token...
[   28.720560] fetch-ec2-metadata-start[1506]: curl: (22) The requested URL returned error: 401
[   29.721970] fetch-ec2-metadata-start[1406]: (attempt 10/10) validating the EC2 instance metadata service v2 token...
[   29.725800] fetch-ec2-metadata-start[1513]: curl: (22) The requested URL returned error: 401
[   30.728080] fetch-ec2-metadata-start[1406]: getting EC2 instance metadata...
[   30.731982] fetch-ec2-metadata-start[1515]: curl: (22) The requested URL returned error: 401
[   30.736337] fetch-ec2-metadata-start[1517]: curl: (22) The requested URL returned error: 401
[   30.740249] fetch-ec2-metadata-start[1518]: curl: (22) The requested URL returned error: 401
[   30.743971] fetch-ec2-metadata-start[1519]: curl: (22) The requested URL returned error: 401
[  OK  ] Finished fetch-ec2-metadata.service.
         Starting Apply EC2 Data...
[  OK  ] Finished Apply EC2 Data.
         Starting SSH Daemon...
[  OK  ] Started SSH Daemon.
         Starting Print SSH Host Key...
-----BEGIN SSH HOST KEY FINGERPRINTS-----
256 SHA256:J5vpSDmh8fXONHxYT7dfKOrWC28zWGglO6RjavRsFRU root@ip-172-31-62-239.us-west-2.compute.internal (ED25519)
4096 SHA256:pRyFHB2EoKabHysZKNptZNSyy+B+zkmfXpITaBC1QWU root@ip-172-31-62-239.us-west-2.compute.internal (RSA)
-----END SSH HOST KEY FINGERPRINTS-----
[  OK  ] Finished Print SSH Host Key.
[  OK  ] Reached target Multi-User System.
         Starting Reconfigure the system from EC2 userdata on startup...
[  OK  ] Finished Reconfigure the system from EC2 userdata on startup.

It does appear to be an IMDS issue. I'll try to follow up with AWS support too.

Though I don't think that NixOS should continue booting through this failure. It says it put the keys there, but in reality we still get a password prompt for root.

Also just for my case why not add a backoff + retry longer? I don't see a huge downside personally.

BTW those keys' fingerprints as loaded/printed do not match the key pair we've uploaded and use for our servers.

arianvp commented 5 months ago

Also just for my case why not add a backoff + retry longer? I don't see a huge downside personally.

Yep that makes sense. The IMDS fetcher is defined here https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/ec2-metadata-fetcher.sh

So you'll need to modify that and then build and upload a new AMI yourself.

(Side-note; I wonder if the script can be simplified if we add a --retry-connrefused to the curl call instead of doing a while loop)

BTW those keys' fingerprints as loaded/printed do not match the key pair we've uploaded and use for our servers.

Those are the host key fingerprints. Not the fingerprints of the client keys you upload. The host key is what you use to authenticate against the server (not the other way around) and they're generated fresh on startup. The idea is that you can retrieve the fingerprint from the console output and compare them with the fingerprint that is printed the first time you connect to the instance

e.g. the

$ ssh root@34.222.101.176
The authenticity of host '34.222.101.176 (34.222.101.176)' can't be established.
ED25519 key fingerprint is SHA256:/cDtKeoOXMDi7CdtH6H1OnhvJCTJa7OgjAFGPRaY1xg.

prompt.

hjkatz commented 5 months ago

Beautiful, thank you for the information and pointers to the script. I'll find time this weekend to open a PR.

arianvp commented 5 months ago

Though I don't think that NixOS should continue booting through this failure. It says it put the keys there, but in reality we still get a password prompt for root.

Yeh I agree that fetch-ec2-metadata.service should fail if it cant reach the IMDS. And the dependent services ideally fail too

arianvp commented 5 months ago

Hey @mswilson long shot. Does having IMDS not being reachable up to 20 seconds into boot for c6id instances ring a bell to you at all?

mswilson commented 5 months ago

That would be unexpected behavior. Are you sure that networking is getting fully initialized?

Matt

On Fri, May 24, 2024 at 2:57 PM Arian van Putten @.***> wrote:

Hey @mswilson https://github.com/mswilson long shot. Does having IMDS not being reachable up to 20 seconds into boot for c6id instances ring a bell to you at all?

— Reply to this email directly, view it on GitHub https://github.com/NixOS/amis/issues/137#issuecomment-2130418844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPE5OK4WXQ7RTO26Q4J3SDZD6ZVJAVCNFSM6AAAAABIH7TKQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGQYTQOBUGQ . You are receiving this because you were mentioned.Message ID: @.***>

hjkatz commented 5 months ago

cc @arianvp, the PR: https://github.com/NixOS/nixpkgs/pull/314427

I would love any advice for building the AMI to test this. This is my first PR for the nixpkgs ecosystem and I do not run a NixOS machine personally.

arianvp commented 5 months ago

Are you sure that networking is gettingfully initialized?

I am pretty sure the network is properly initialized before we attempt to reach IMDS:

[  OK  ] Started DHCP Client.
[  OK  ] Reached target Network is Online.

is printed before the imds fetcher script starts which means we successfully got a DHCP lease.

I'd love to debug this more but the instance type is really expensive. I'll ask the NixOS infra team if I can use their AWS Credits to debug on this instance type. Otherwise a donation to https://github.com/sponsors/arianvp would be very welcome to cover the costs :')

I would love any advice for building the AMI to test this. This is my first PR for the nixpkgs ecosystem and I do not run a NixOS machine personally.

Could you join https://app.element.io/#/room/#aws:nixos.org ? I think Matrix is a better place to help you out with this. I also made a PR to add some docs on this: https://github.com/NixOS/nixpkgs/pull/314523/files

arianvp commented 5 months ago

I tried to reproduce on your instance type and I can't. IMDS is always instantly responds. and I am always able to SSH.

I spawned 15 instances of type c6id.8xlarge and non of them had the issue you're describing.

All of them retrieved the IMDS data on first try like follows:

[   24.148384] fetch-ec2-metadata-start[1466]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   24.850824] fetch-ec2-metadata-start[1466]: (attempt 1/10) validating the EC2 instance metadata service v2 token...
[   24.854566] fetch-ec2-metadata-start[1466]: getting EC2 instance metadata...

At this point I really don't know what is going on in your environment. But I can not reproduce any of the behaviour you're reporting I'm afraid.

I suggest contacting AWS support and I am going to close this issue as "can not reproduce".

The script that I used:

#!/usr/bin/env bash

set -x

instancetype=c6id.8xlarge
export AWS_REGION=us-west-2

id=$(aws ec2 run-instances --user-data file://user-data.sh --key-name arian@Arians-MacBook-Pro.local --image-id  ami-0f1471461676aaa00 --instance-type "$instancetype" --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)

terminate_instance() {
  aws ec2 terminate-instances --instance-ids "$id"
  sleep 10
  aws ec2 get-console-output --latest --instance-id "$id" --query 'Output' --output text | tee /dev/stderr > "logs/$id"
}

trap terminate_instance EXIT

ip=$(aws ec2 describe-instances --instance-ids "$id" --query 'Reservations[0].Instances[0].PublicIpAddress' --output text)

aws ec2 wait instance-running --instance-id "$id"

until ssh -oStrictHostKeyChecking=accept-new "root@$ip" echo hello; do sleep 5; done
arianvp commented 5 months ago

(Feel free to reopen if you have more info)

arianvp commented 5 months ago

I have successfully been able to reproduce your issue.

It reproduces as soon as you launch an instance in an ipv6 dual-stack subnet.

The reason why it's happening:

dhcpcd reports ready as soon as it receives any lease (ipv6 or ipv4). The instance receives the ipv6 lease first This causes network-online.target to become ready which in turn fires off the metadata fetcher. But at this point there is no ipv4 connectivity yet and the fetcher fails.

I think the easiest and most correct fix here is that we should adjust the script to try both the IPv6 and IPv4 IMDS address.

hjkatz commented 5 months ago

@arianvp Hooray for us!

We were successfully able to avoid the issue when turning off ipv6 IP assignment on our EC2 instances. We launched ~40 of them and all came up without issues.

Would you like to take a stab at that ipv6 adjustments to the metadata script?

hjkatz commented 5 months ago

Check out the PR here: https://github.com/NixOS/nixpkgs/pull/314427

Happy to make adjustments and work back and forth there.

hjkatz commented 5 months ago

Did not mean to close

commiterate commented 1 month ago

Given the complicated nature of amazon-init and some of the other AWS EC2 scripts that exist in Nixpkgs, I'm wondering if we should just deprecate those in favor of the existing cloud-init module and other modules the Amazon Linux team has open sourced over the years. I don't see anything wrong with cloud-init running the following user data script to apply a NixOS config:

#!/bin/sh

# /bin/sh should be set up by the NixOS image to point to bash-interactive in the Nix store.

# Regenerate the hardware configuration?
nixos-generate-config

# Overwrite the existing configuration.
cat > /etc/nixos/configuration.nix << END_OF_FILE
{ config, pkgs, ... }:

{
  imports = [
    ./hardware-configuration.nix
  ];

  nix = {
    settings = {
      experimental-features = [
        "nix-command"
        "flakes"
      ];
    };
  };

  # ...
}
END_OF_FILE

# Rebuild the system.
nixos-rebuild switch

# Maybe create a flake.nix if they would rather a flake configuration instead.

Alternatively, the user script uses a wget (assuming GNU coreutils is available) or nix-shell --packages awscli2 --command "aws s3 cp s3://... /etc/nixos/configuration.nix" to download a NixOS configuration from an AWS S3 bucket.

That way, the NixOS AMIs provide:

All of these properly wait for IMDS and are maintained by upstream (i.e. cloud-init + AWS) instead of Nixpkgs maintainers.

Here's the Amazon Linux 2023 package list for reference:

https://docs.aws.amazon.com/linux/al2023/ug/image-comparison.html

The only thing missing would be a systemd oneshot unit that looks for an EC2 key pair from IMDS and registers it with the SSH daemon.

Personally, I don't really view that as high priority given that they can use SSM Session Manager with the SSM agent installed (also needs an SSM IAM role for the instance profile, but AWS provides a managed policy for that) or they can use EC2 Instance Connect to push an ephemeral SSH key through IAM-protected APIs.

Long-lived credentials like EC2 key pairs are bad practice to begin with.

The only other optional things I can think of are:

There's also the AWS CodeDeploy Agent, though the deploy pattern for that with mutable environments feels like something most NixOS users won't use.

Instead, I imagine NixOS users are more likely to bake immutable AMIs ahead of time and then use EC2 Instance Refresh to replace ASG instances. I have a separate feature request to the AWS CloudFormation team to allow triggering that feature natively with a CloudFormation stack update.

https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/2119

To support setups that bake a NixOS disk image and upload those to S3 for EC2's ImportImage or ImportSnapshot + RegisterSnapshot workflows, I have another feature request to Image Builder for both CloudFormation support and expanding ImportImage to support AArch64 (also expand EC2 Image Builder's ImportVmImage to encompass it all so we get an ARN-able AMI that can be targeted with Image Builder lifecycle rules).

https://github.com/aws/ec2-image-builder-roadmap/issues/103

commiterate commented 1 month ago

On a separate note for the IPv4 and IPv6 lease race condition, it looks like systemd-networkd has some options around that.

https://github.com/systemd/systemd/issues/16700

https://search.nixos.org/options?channel=unstable&show=systemd.network.wait-online.anyInterface

I think even in IPv6-only subnets, IMDS is always available over IPv4?

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html

When the IPv6 endpoint is enabled, the IPv4 endpoint remains enabled.

https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-ip-address-range

Resources in IPv6-only subnets are assigned IPv4 link-local addresses from CIDR block 169.254.0.0/16. These addresses are used to communicate with VPC services like Instance Metadata Service (IMDS).

arianvp commented 1 month ago

Ah nice!

We just have to configure networkd to wait on both ipv6 and ipv4 then I think.

Resources in IPv6-only subnets are assigned IPv4 link-local addresses from CIDR block 169.254.0.0/16.