DSGT-DLP / Deep-Learning-Playground

Web Application where people new to Deep Learning can input a dataset and toy around with basic Pytorch modules without writing any code
MIT License
26 stars 8 forks source link

Migrate training into celery and upload results to s3 #1157

Closed andrewpeng02 closed 6 months ago

andrewpeng02 commented 7 months ago

Migrate training into celery and upload results to s3

Github Issue Number Here: #1136 What user problem are we solving?

Currently, we do the training during the HTTP request. I plan on changing the train HTTP endpoints by scheduling a train job via Celery and returning the job id in the request. This offers numerous advantages

Long training tasks (>2 min) shouldn't be done in an HTTP request. Scheduling the training job will allow the endpoint to return quickly Eventually, we can decouple the backend with the training, so that we can use cheaper EC2 instances for the Django backend, and GPU instances for the actual training Notifying the user will be done in websockets in this issue (https://github.com/DSGT-DLP/Deep-Learning-Playground/issues/920#event-11765625412), for now I'll create an HTTP endpoint to retrieve the training results that the user can ping.

What solution does this PR provide? Training jobs are executed by a celery worker, which polls from the queue set up in aws sqs. Currently you have to run the celery worker locally, but eventually the workers will run in the g4dn.xlarge ec2 instances. In this pr, I added celery endpoints which the backend will call. Then, the frontend will request the training results data from GET /api/training/results/{trainspaceId}, and it'll display the data.

I also moved some files into /celery to make it more clear that the celery worker will operate in there (but it'll also access files in the django app)

Testing Methodology

  1. Install the new backend dependencies
  2. Start the frontend using dlp-cli frontend start
  3. From the training/ folder, start the backend using AWS_PROFILE=dlp docker compose up --build
  4. Start the sst locally, using AWS_PROFILE=sst
  5. Tested training tabular regression and classification data

https://github.com/DSGT-DLP/Deep-Learning-Playground/assets/47485510/0fb5a43e-0948-4029-b8d7-575a288997ad

Screenshot 2024-04-09 at 3 47 28 PM

Any other considerations

sonarcloud[bot] commented 7 months ago

Quality Gate Passed Quality Gate passed

Issues
3 New issues
54 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

karkir0003 commented 7 months ago

@andrewpeng02 Question RE testing video:

  1. After you submitted the training requests, were the results shown just dummy data for now or was it actual?
andrewpeng02 commented 6 months ago
andrewpeng02 commented 6 months ago

AWESOME PR @andrewpeng02 . I left a few nits in the PR, but looks really good.

Quick question: Will there be any documentation to understand the development process for say adding a new training type now that we have a celery based structure?

It shouldn't be difficult to support new training types, you should be able to figure it out by referencing the existing tabular and image training jobs

sonarcloud[bot] commented 6 months ago

Quality Gate Passed Quality Gate passed

Issues
2 New issues
42 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.8% Duplication on New Code

See analysis details on SonarCloud