clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Catch errors when trying to provision an instance type that is not available. #26

Open willprice opened 3 years ago

willprice commented 3 years ago

When I set an instance type in limits.yaml which I do not have permission to launch, the following occurs:

citc $ cat limits.yaml
cat limits.yaml
g3s.xlarge: 2
citc $ finish

user $ srun --pty -c 2  -I bash

citc $ tail /var/log/slurm/elastic.log
2020-12-11 10:57:04,484 startnode  ERROR     problem launching instance: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

From the perspective of the user srun hangs leaving the error opaque, it would be better if srun failed fast rather than just hung with a timeout, or limits.yaml was checked to determine whether it is possible to launch such an instance when finish is executed.