aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.96k stars 284 forks source link

Cannot start docker management cluster inside vagrant environment #6068

Closed rgl closed 1 year ago

rgl commented 1 year ago

What happened:

I'm trying to try eks-anywhere for the first time inside a vagrant environment at https://github.com/rgl/eks-anywhere-vagrant by following the docker guide at https://anywhere.eks.amazonaws.com/docs/getting-started/docker/, but its failing to start for some reason that I need your help to troubleshoot.

I've placed the details at https://github.com/rgl/eks-anywhere-vagrant, including the support bundles:

support-bundle-2023-06-21T08_55_14.tar.gz support-bundle-2023-06-21T08_55_26.tar.gz

What you expected to happen:

I expected it to start without ant errors.

This was unexpected, because the vagrant environment is starting a vanilla ubuntu 22.04 with docker 24.0.2.

The only thing I did not do was disabling cgroupsv2, mainly because, the default launched kubernetes 1.27.x is supposed to support it, but maybe its because of that?

And also, "cgroups" does not seem to be mentioned anymore in the referenced troubleshooting guide.

How to reproduce it (as minimally and precisely as possible):

Please see the Usage section at https://github.com/rgl/eks-anywhere-vagrant.

In particular, the management cluster is created at https://github.com/rgl/eks-anywhere-vagrant/blob/main/provision-management-cluster.sh

Anything else we need to know?:

Environment: Ubuntu 22.04 inside a VM managed by vagrant.

jonathanmeier5 commented 1 year ago

@rgl we currently only document support for Ubuntu 20.04, not 22.04.

That said, I don't think that's your issue. Based on the support bundle your problem is in the capd-controller logs in the support bundle:

Failed to create control group inotify object: Too many open files
    Failed to allocate manager object: Too many open files
    [!!!!!!] Failed to allocate manager object.

...
E0621 08:25:39.919677       1 controller.go:329] "Reconciler error" err="failed to exec DockerMachine bootstrap: failed to run cloud config: stdout:  stderr: : error creating container exec: Error response from daemon: Container 6a764cc2250fed64081d43a89bccc199377dbf9c09f0b5bfd6129d350ab9b528 is not running" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="eksa-system/mgmt-md-0-1687335868846-vrfcf" namespace="eksa-system" name="mgmt-md-0-1687335868846-vrfcf" reconcileID=0a00eb9d-1651-4a18-9b36-6c380664c9b9

There are known kind issues with running out of inotify resources as described here. We don't test on vagrant so I can't say definitively but tweaking those settings might get things working.

rgl commented 1 year ago

That was it! Its now working.

Thank You!