2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
106 stars 65 forks source link

AWS cost attribution: put each hub in jupyter-meets-the-earth into its own user nodegroup #5101

Closed GeorgianaElena closed 6 days ago

sgibson91 commented 1 week ago

Having difficulty creating a couple of nodegroups on jmte

2024-11-14 15:49:51 [ℹ]  1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console
2024-11-14 15:49:51 [ℹ]  to cleanup resources, run 'eksctl delete nodegroup --region=us-west-2 --cluster=jupyter-meets-the-earth --name=<name>' for each of the failed nodegroup
2024-11-14 15:49:51 [✖]  waiter state transitioned to Failure
Error: failed to create nodegroups for cluster "jupyter-meets-the-earth"
sgibson91 commented 1 week ago

This is failing to create the two largest gpu instances, so could be a quota issue?

sgibson91 commented 1 week ago

https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks?filteringText=&filteringStatus=active&viewNested=true

sgibson91 commented 1 week ago

I tried to increase a quota request but I think it was denied

sgibson91 commented 1 week ago

@consideRatio do you mind taking a look when you have 5 mins?

consideRatio commented 1 week ago

Looking at the AWS console, under cloudformation -> stacks, I find one stack representing the node group failing to be created. The error I spot looking at events when it was to be created sais:

Resource handler returned message: "The maximum number of rules per security group has been reached. (Service: Ec2, Status Code: 400, Request ID: a30358b7-aba4-4cd4-a0c7-76eb1d89618c)" (RequestToken: 92fc5773-752c-49b4-bedd-aed51481749e, HandlerErrorCode: ServiceLimitExceeded)

I'm not sure what these security group rules relate to, but I imagine its related to having very many separate node groups in a k8s cluster and there is a need for more and more rules due to that, and this broke things. I figure the next step is to google that error message and try to figure out what its really about. Axel demands my attention currently though, so dropping the ball here for now.

sgibson91 commented 1 week ago

Thanks @consideRatio - tbh, I was expecting an answer tomorrow 😅 appreciated!

sgibson91 commented 6 days ago

So I went back to this today and added node-purpose tags, and suddenly no more errors 🤷🏻‍♀️