DSGT-DLP / Deep-Learning-Playground

Web Application where people new to Deep Learning can input a dataset and toy around with basic Pytorch modules without writing any code
MIT License
26 stars 8 forks source link

Training Container - Transition off of Fargate #779

Open karkir0003 opened 1 year ago

karkir0003 commented 1 year ago

Describe the solution you'd like Problem: We want to be able to run training tasks in DLP in compute instance (EC2) with GPU. However, when we configure Fargate to provision the EC2 instances on our behalf, there doesn't seem to be an option to configure GPU (just vCPU, CPU, memory). This isn't going to scale well for DLP.

Solution: change configuration in training container to use EC2 mode instead of "Fargate" mode. Create an EC2 instance with GPU support (don't go for super hefty one yet) + use the autoscaling group that depends on queue size.

Update the terraform configuration for training cluster accordingly and redeploy.

All EC2 instances + autoscale group must be written in terraform

@thomaschin35 is a great point of contact for terraform work Additional context terraform very important!

Setup Instructions (what branch to work off of) Run the following commands

git checkout nextjs
git pull origin nextjs
git checkout -b ecs-training-container-no-fargate

FYI: If you are not able to immediately run git checkout nextjs, make sure you commit your changes in the current branch or run git stash and then execute the above commands

karkir0003 commented 1 year ago

https://github.com/aws/containers-roadmap/issues/88

karkir0003 commented 1 year ago

this github issue talks about how fargate currently doesn't support configuring GPU instances

karkir0003 commented 1 year ago

Steps: (write this stuff in terraform and deploy only what you change, NOT everything)

  1. create EC2 instance that supports GPU provision
  2. In that EC2 instance, configure an autoscaling group
  3. Add Cloudwatch logging support for the created EC2 instance
  4. In the deep-learning-playground-kernels ECS cluster, remove fargate as the basis and select the "EC2" option. Set the EC2 instance type to be the created instance from step 1 and add the autoscaling group you created from step 2 into the ECS cluster configuration. Everything else should remain the same mostly.
karkir0003 commented 1 year ago

@noah-iversen @thomaschin35 FYI

karkir0003 commented 1 year ago

Terraform only knows what services you create through it