Open karkir0003 opened 1 year ago
this github issue talks about how fargate currently doesn't support configuring GPU instances
Steps: (write this stuff in terraform and deploy only what you change, NOT everything)
deep-learning-playground-kernels
ECS cluster, remove fargate as the basis and select the "EC2" option. Set the EC2 instance type to be the created instance from step 1 and add the autoscaling group you created from step 2 into the ECS cluster configuration. Everything else should remain the same mostly. @noah-iversen @thomaschin35 FYI
Terraform only knows what services you create through it
Describe the solution you'd like Problem: We want to be able to run training tasks in DLP in compute instance (EC2) with GPU. However, when we configure Fargate to provision the EC2 instances on our behalf, there doesn't seem to be an option to configure GPU (just vCPU, CPU, memory). This isn't going to scale well for DLP.
Solution: change configuration in training container to use EC2 mode instead of "Fargate" mode. Create an EC2 instance with GPU support (don't go for super hefty one yet) + use the autoscaling group that depends on queue size.
Update the terraform configuration for training cluster accordingly and redeploy.
All EC2 instances + autoscale group must be written in terraform
@thomaschin35 is a great point of contact for terraform work Additional context terraform very important!
Setup Instructions (what branch to work off of) Run the following commands
FYI: If you are not able to immediately run
git checkout nextjs
, make sure you commit your changes in the current branch or rungit stash
and then execute the above commands