Prepare starting/stopping script for VM for training

praiskup commented 1 week ago

Hmm, I'm getting: botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

praiskup commented 1 week ago

I think I don't want to experiment with this: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.1/html/installing/installing_on_aws#converting_image_to_ami

@TomasTomecek wdyt? Is it enough to start with F40 and move later? If we have ami to use, restarting the machine will be trivial.

TomasTomecek commented 1 week ago

I think I don't want to experiment with this: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.1/html/installing/installing_on_aws#converting_image_to_ami

@TomasTomecek wdyt? Is it enough to start with F40 and move later? If we have ami to use, restarting the machine will be trivial.

Nice, you found it yourself, I just got here to post the same link :facepalm:

Agreed, let's go with F40 for now (though the 01 VM has F39 so we can have the old GCC, don't know when nvidia releases new drivers that will work in F40, maybe they already do).

We can switch RHEL AI in the future.

xsuchy commented 5 days ago

don't know when nvidia releases new drivers that will work in F40, maybe they already do @tt can you check it? Otherwise I would go with F39.

praiskup commented 5 days ago

This issue is being transferred. Timeline may not be complete until it finishes.

Hehe, I think this is never going to be migrated :-) so I'm closing the original one.

praiskup commented 5 days ago

@xsuchy I need your help with this, per this infra announce https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/thread/4ZZQBIJ5XS7HSP44EXMD4OKGXDUPBV34/

praiskup commented 4 days ago

@jpodivin @TomasTomecek do you think we need to have machine with 64+ vCPUs? (not sure how useful it is, whether we don't hunt for more powerful graphics instead).

xsuchy commented 4 days ago

Case ID 172891961400818 opened. with content:

Limit increase request 1
Service: EC2 Instances
Primary Instance Type: All G instances
Region: EU (Ireland)
Limit name: Instance Limit
New limit value: 270
------------
Use case description: We want to spawn g5.48xlarge in Ireland. It has 192 VCPUs.
When we tried to spawn it we got:
 botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit [http://aws.amazon.com/contact-us/ec2-request ](http://aws.amazon.com/contact-us/ec2-request)
 to request an adjustment to this limit.

So I am opening this request.

TomasTomecek commented 4 days ago

I filtered all instances with 192GB GPU memory in Ireland:

We can see they are pretty different, though all still super-powerful. I guess that's how Amazon builds these powerful machines - ton of RAM and vCPU.

One that stands out being cheaper, inf2.24xlarge (INF2 24xlarge), is unknown to me, it has an Amazon GPU, never heard of it - I fear our software stack wouldn't work with it.

Back to your question: p2.16xl has less vCPU and is cheaper by 10%, that one should work for us as well.

praiskup commented 4 days ago

I started a testing p2.16xl instance, but with F40 (needs to be F39, because cuda drivers) and I'm not able to drop it (need @xsuchy's help).

praiskup commented 2 days ago

@xsuchy I still see:

An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 128 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

:shrug: seems like the process in the customer ticket failed.

praiskup commented 2 days ago

Also: You are not authorized to perform this operation. User: arn:aws:iam::125523088429:user/logdetective is not authorized to perform: ec2:AssignIpv6Addresses on resource:

xsuchy commented 2 days ago

I added ec2:AssignIpv6Addresses in IAM to allow list.

fedora-copr / logdetective

Prepare starting/stopping script for VM for training #77