awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.41k stars 1.14k forks source link

Slow startup of AL2023 AMI #1751

Open stijndehaes opened 4 months ago

stijndehaes commented 4 months ago

After trying the AL2023 AMI I noticed startup is way slower compared to the AL2 image. After some help from @ndbaker1 I made the following analysis using systemd-analyze.

I added a plot using systemd-analyze, the nodeadm component appears to be in the hoth path. With nodeadm-config.service taking 20s before it allows cloud-init.service (7.2s) to start, and then nodeadm-run.service takes 10.53s, I think it is waiting for containerd-service (8.7s) to start though.

Looking at the logs of nodeadm-config, it appears to take a long time to start before the first log(5s), but also takes 13s between starting to configure kubelet and getting the kubelet version.

Looking at the logs of nodeadm-run, most time apears to be spend looking up the sandbox image. I noticed in code this is done using containerd config dump and a regex, for some reason this takes a long time.

Not sure what can be done to optimize these steps, but I am willing to help :) I can write golang code and make custom AMI's with custom nodeadm if needed to test things out.

plot

nodeadm-config logs, via journalctl -u nodeadm-config

Apr 04 08:56:49 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.1338603,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.1339903,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2056038,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"datafy-dp-eks-dev-sdh","apiServerEndpoin>
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2059212,"caller":"init/init.go:68","msg":"Enriching configuration.."}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2059379,"caller":"init/init.go:148","msg":"Fetching instance details.."}
Apr 04 08:56:54 localhost nodeadm[1693]: SDK 2024/04/04 08:56:54 DEBUG attempting waiter request, attempt count: 1
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7738488,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-048abcd8a00795a14","region":"eu-west-1","type":"m5.xlarge","availabilityZone":"eu-west-1c>
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7739344,"caller":"init/init.go:162","msg":"Fetching default options..."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7748718,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7749023,"caller":"init/init.go:73","msg":"Validating configuration.."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7749126,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7760847,"caller":"init/init.go:96","msg":"Configuring daemons..."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7761178,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"containerd"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7761524,"caller":"containerd/config.go:38","msg":"Writing containerd config to file..","path":"/etc/containerd/config.toml"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.894721,"caller":"init/init.go:107","msg":"Configured daemon","name":"containerd"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.8947728,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7341733,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.28.5"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7360287,"caller":"kubelet/config.go:218","msg":"Setup IP for node","ip":"10.0.3.248"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7370236,"caller":"kubelet/config.go:351","msg":"Writing kubelet config to file..","path":"/etc/kubernetes/kubelet/config.json"}
Apr 04 08:57:09 localhost nodeadm[1693]: {"level":"info","ts":1712221029.3084157,"caller":"init/init.go:107","msg":"Configured daemon","name":"kubelet"}
Apr 04 08:57:09 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Apr 04 08:57:09 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.

nodeadm-run logs, via journalctl -u nodeadmrun:

Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: Starting nodeadm-run.service - EKS Nodeadm Run...
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5944886,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5945692,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5981271,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"datafy-dp>
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5983663,"caller":"init/init.go:68","msg":"Enriching configuration.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5983822,"caller":"init/init.go:148","msg":"Fetching instance details.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: SDK 2024/04/04 08:57:17 DEBUG attempting waiter request, attempt count: 1
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.72048,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-048abcd8a00795a14","region":"eu-west-1","type":"m5.xlarge",">
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7205393,"caller":"init/init.go:162","msg":"Fetching default options..."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7216415,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/p>
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.721685,"caller":"init/init.go:73","msg":"Validating configuration.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7216954,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7227976,"caller":"init/init.go:112","msg":"Setting up system aspects..."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228348,"caller":"init/init.go:115","msg":"Setting up system aspect..","name":"local-disk"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228491,"caller":"system/local_disk.go:26","msg":"Not configuring local disks!"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228668,"caller":"init/init.go:119","msg":"Set up system aspect","name":"local-disk"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.722877,"caller":"init/init.go:115","msg":"Setting up system aspect..","name":"networking"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.722903,"caller":"system/networking.go:79","msg":"writing eks_primary_eni_only network configuration"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7329752,"caller":"init/init.go:119","msg":"Set up system aspect","name":"networking"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7330172,"caller":"init/init.go:128","msg":"Ensuring daemon is running..","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7368443,"caller":"init/init.go:132","msg":"Daemon is running","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7369227,"caller":"init/init.go:134","msg":"Running post-launch tasks..","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7369375,"caller":"containerd/sandbox.go:20","msg":"Looking up current sandbox image in containerd config.."}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.1754317,"caller":"containerd/sandbox.go:32","msg":"Found sandbox image","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.1754868,"caller":"containerd/sandbox.go:34","msg":"Fetching ECR authorization token.."}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.2011657,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: E0404 08:57:25.202548    1918 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable d>
Apr 04 08:57:27 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221047.207119,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0419524,"caller":"containerd/sandbox.go:53","msg":"Finished pulling sandbox image","image-ref":"sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20>
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0420063,"caller":"init/init.go:138","msg":"Finished post-launch tasks","name":"containerd"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.042022,"caller":"init/init.go:128","msg":"Ensuring daemon is running..","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0442576,"caller":"init/init.go:132","msg":"Daemon is running","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0443828,"caller":"init/init.go:134","msg":"Running post-launch tasks..","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.044407,"caller":"init/init.go:138","msg":"Finished post-launch tasks","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: nodeadm-run.service: Deactivated successfully.
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: Finished nodeadm-run.service - EKS Nodeadm Run.

Originally posted by @stijndehaes in https://github.com/awslabs/amazon-eks-ami/issues/1696#issuecomment-2036621393

stijndehaes commented 4 months ago

@cartermckinnon suggested it might have something to do with EBS lazy block fetching resulting in slow startup for the executables we need: https://github.com/awslabs/amazon-eks-ami/issues/1696#issuecomment-2037546262

stijndehaes commented 4 months ago

One way we can speed things up is to make sure that kubelet and containerd are loaded earlier in the chain. We could for example:

cartermckinnon commented 4 months ago

We can test the EBS hypothesis by enabling fast-restore on the snapshot: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-snapshot-restore.html

This is pretty expensive so I understand if you don't want to give it a try; but if you can update the PR description with your launch steps we can have consistency across our experiments @stijndehaes

stijndehaes commented 4 months ago

@cartermckinnon I am willing to give it a try. As to your launch steps question, I am not enterily sure I understand what you want. Can you elaborate?

stijndehaes commented 4 months ago

Here is the new result, with fast restore on the snapshot enabled. Sadly I see roughly the same timing, I also see similar delays in the logs.

plot

sh-5.2# journalctl -u nodeadm-run
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: Starting nodeadm-run.service - EKS Nodeadm Run...
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2066824,"caller":"init/init.go:49",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.206749,"caller":"init/init.go:57","m>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2111814,"caller":"init/init.go:66",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2115018,"caller":"init/init.go:68",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.211519,"caller":"init/init.go:148",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: SDK 2024/04/08 10:05:55 DEBUG attempting waiter request, attempt cou>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3542361,"caller":"init/init.go:161",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3543026,"caller":"init/init.go:162",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551373,"caller":"init/init.go:170",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551688,"caller":"init/init.go:73",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551803,"caller":"init/init.go:78",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3563209,"caller":"init/init.go:112",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356349,"caller":"init/init.go:115",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356361,"caller":"system/local_disk.g>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356374,"caller":"init/init.go:119",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3563807,"caller":"init/init.go:115",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3564048,"caller":"system/networking.>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3663833,"caller":"init/init.go:119",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3664064,"caller":"init/init.go:128",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3696218,"caller":"init/init.go:132",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3697126,"caller":"init/init.go:134",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3697586,"caller":"containerd/sandbox>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.477823,"caller":"containerd/sandbox.>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.4778726,"caller":"containerd/sandbox>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.516175,"caller":"containerd/sandbox.>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: E0408 10:06:00.530553    1914 remote_image.go:135] PullImage "602401>
Apr 08 10:06:02 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570762.5316176,"caller":"containerd/sandbox>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0406716,"caller":"containerd/sandbox>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0407143,"caller":"init/init.go:138",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0407345,"caller":"init/init.go:128",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0431368,"caller":"init/init.go:132",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0432801,"caller":"init/init.go:134",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0433705,"caller":"init/init.go:138",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: nodeadm-run.service: Deactivated successfully.
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: Finished nodeadm-run.service - EKS Nodeadm Run.
sh-5.2# journalctl -u nodeadm-config
Apr 08 10:05:33 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.4740438,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.4741585,"caller":"init/init.go:57","msg":"Loading configuration..">
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5820394,"caller":"init/init.go:66","msg":"Loaded configuration","c>
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5823655,"caller":"init/init.go:68","msg":"Enriching configuration.>
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5824058,"caller":"init/init.go:148","msg":"Fetching instance detai>
Apr 08 10:05:38 localhost nodeadm[1687]: SDK 2024/04/08 10:05:38 DEBUG attempting waiter request, attempt count: 1
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.696702,"caller":"init/init.go:161","msg":"Instance details populat>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6967807,"caller":"init/init.go:162","msg":"Fetching default option>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6998506,"caller":"init/init.go:170","msg":"Default options populat>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.699901,"caller":"init/init.go:73","msg":"Validating configuration.>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6999104,"caller":"init/init.go:78","msg":"Creating daemon manager.>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.7011337,"caller":"init/init.go:96","msg":"Configuring daemons..."}
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.7011614,"caller":"init/init.go:103","msg":"Configuring daemon...",>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.701194,"caller":"containerd/config.go:38","msg":"Writing container>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.746515,"caller":"init/init.go:107","msg":"Configured daemon","name>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.74656,"caller":"init/init.go:103","msg":"Configuring daemon...","n>
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.596923,"caller":"kubelet/config.go:300","msg":"Detected kubelet ve>
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.5988646,"caller":"kubelet/config.go:218","msg":"Setup IP for node">
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.5999613,"caller":"kubelet/config.go:351","msg":"Writing kubelet co>
Apr 08 10:05:48 localhost nodeadm[1687]: {"level":"info","ts":1712570748.362136,"caller":"init/init.go:107","msg":"Configured daemon","name>
Apr 08 10:05:48 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Apr 08 10:05:48 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.
dims commented 4 months ago

@stijndehaes the output for journalctl -u nodeadm-run is truncated (only 140 characters per line...) can you please capture the full output? (like you did in your issue text right at the top?

from the initial log, it seems like containerd config dump could be taking time? https://github.com/awslabs/amazon-eks-ami/blob/f55411cc8c643fd2a61270c599e9f23c2ca202d9/nodeadm/internal/containerd/sandbox.go#L20-L32

stijndehaes commented 3 months ago

@dims Sorry it took so long to get back to you. I have restried with fast restore and got some better results. I takes around 50s with fast restore (still slower than the 40ish seconds I got with al2 without fast restore) and 100s without fast restore. So that appears to have an influence.

Here are the logs of the services with and without fast restore:

nodeadm-run with fast-restore nodeadm-config with fast-restore nodeadm-config without fast-restore nodeadm-run without fast-restore

cartermckinnon commented 3 months ago

Thanks for testing this out, my findings concur. I'll see what I can do!

stijndehaes commented 3 months ago

Thanks for testing this out, my findings concur. I'll see what I can do!

@cartermckinnon thank you very much, if I can help with anything please reach out :) At the moment I have a good flow of testing things out ;)

stijndehaes commented 3 weeks ago

Just retried it, the result are still similar. The biggest issue appears to be the call to get the version of kubelet, if we could get the version another way I think that would help a lot. But I am not sure what the best way is to store the kubelet version. One thing we could do it store is edit the service file and store the version in the nodeadm-config.service file:

Environment="KUBELET_VERSION=1.2.3"

I will run a try out to see if this helps

dims commented 3 weeks ago

@stijndehaes you mean kubelet --version is taking time?

( guessing from https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/version.go#L18 )

stijndehaes commented 3 weeks ago

@dims correct! That is the reason for the big timeout between these 2 log lines:

May 22 14:03:48.312560 localhost nodeadm[1689]: {"level":"info","ts":1716386628.3125567,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
May 22 14:04:01.246690 localhost nodeadm[1689]: {"level":"info","ts":1716386641.2466207,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.29.3"}
stijndehaes commented 3 weeks ago

I did the following experiment. I change the GetKubeletVersion function to the following:

func GetKubeletVersion() (string, error) {
    if kubeletVersion := os.Getenv("KUBELET_VERSION"); kubeletVersion != "" {
        zap.L().Info("Found kubelet version in environment")
        return kubeletVersion, nil
    }
    rawVersion, err := GetKubeletVersionRaw()
    if err != nil {
        return "", err
    }
    version := parseSemVer(*rawVersion)
    return version, nil
}

And changed the systemd config to the following:

[Unit]
Description=EKS Nodeadm Config
Documentation=https://github.com/awslabs/amazon-eks-ami
# run before cloud-init, then user can still execute their
# own workflows from ec2 userdata cloud-init scripts
Before=cloud-init.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip run
Environment="KUBELET_VERSION=v1.29.3"

[Install]
WantedBy=multi-user.target

Obviously the hard coded KUBELET_VERSION needs to be changed.

This resulted in the following (result from systemd-analyze) Before change: Startup finished in 698ms (kernel) + 7.574s (initrd) + 1min 2.208s (userspace) = 1min 10.481s graphical.target reached after 1min 2.146s in userspace. After change: Startup finished in 445ms (kernel) + 5.489s (initrd) + 34.253s (userspace) = 40.188s graphical.target reached after 34.189s in userspace.

The nodeadm-config.service went from 21s to 5.12s shaving of around 15s in the hot startup path:

Before: plot

After: plot4

Do you guys think it makes sense to include this optimisation? Obviously we still need to fill in the correct kubelet version. And sadly it is still slower than the AL2 AMI

stijndehaes commented 3 weeks ago

As an FYI here is the systemd-analyze plot from an AL2 Image. Startup finished in 1.355s (kernel) + 1.379s (initrd) + 18.876s (userspace) = 21.611s

As you can see the 45s is still far removed from the 22s it used to be, ofcourse certain changes are because of the new AL2023 base image

plot-al2

dims commented 3 weeks ago

@stijndehaes looking at the graphs for al2023 above, what happens if you disable the update-motd.service? (the before picture shows significant contribution from that one)

dims commented 3 weeks ago

@stijndehaes another one, what happens if you switch the order of getting kubelet version first (caching it) and THEN configuring the kubelet service?

stijndehaes commented 3 weeks ago

@stijndehaes looking at the graphs for al2023 above, what happens if you disable the update-motd.service? (the before picture shows significant contribution from that one)

Just disable update-motd.service results in: Startup finished in 454ms (kernel) + 3.842s (initrd) + 38.513s (userspace) = 42.810s graphical.target reached after 38.055s in userspace.

So actually a great improvement!

no-motd

stijndehaes commented 3 weeks ago

@stijndehaes another one, what happens if you switch the order of getting kubelet version first (caching it) and THEN configuring the kubelet service?

I am not exactly sure if I understand what you mean by this. Can you eleborate?

dims commented 3 weeks ago

@stijndehaes so please do file a PR for disabling the motd and let the team review it.

on the other one, i was thinking of something like this diff where:

kubelet-version.diff.txt

stijndehaes commented 3 weeks ago

@dims made a PR for disabling motd. https://github.com/awslabs/amazon-eks-ami/pull/1910

As to your suggestion to cache the kubelet version the biggest issue is that we are still using the kubelet executable and that needs to be fetched when that part of the EBS is not initialised yet. So it will still be slow.

That was the reason for my suggestion with the Environment variable set during build. There is no dependency anymore on the kubelet executable.