[FEATURE]: Get Training Backend to Update User on Training Progress with AWS Appsync

DSGT-DLP / Deep-Learning-Playground

Web Application where people new to Deep Learning can input a dataset and toy around with basic Pytorch modules without writing any code

MIT License

24 stars 8 forks source link

[FEATURE]: Get Training Backend to Update User on Training Progress with AWS Appsync #920

Open dwu359 opened 1 year ago

dwu359 commented 1 year ago

Feature Name

Get Training Backend to Update User on Training Progress with AWS Appsync

Your Name

Daniel Wu

Description

The http protocol that we use to communicate between the frontend and backend is unidirectional, meaning that the frontend needs to send a request for the backend to send back a response. To send back training progress, the backend needs to send multiple messages back to the frontend after the initial training request. Luckily, AWS AppSync handles that for us with its websocket pub/sub apis, which can be used to allow bidirectional communication between the frontend and backend. More specifically, we can have both the frontend and backend listen to AppSync's websocket endpoint for messages of a particular channel id and have both the frontend and backend make graph api requests to AppSync to send messages with the same channel id.

Use AWS AppSync to update the user on training progress for a particular training request. For now, let's say that training progress means the # of epochs completed.

github-actions[bot] commented 1 year ago

Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. :wave:

We will look into it and provide a response as soon as possible.

To work on this feature request, you can follow these branch setup instructions:

Checkout the main branch:
```
 git checkout nextjs
```
Pull the latest changes from the remote main branch:
```
 git pull origin nextjs
```
Create a new branch specific to this feature request using the issue number:
```
 git checkout -b feature-920
```
Feel free to make the necessary changes in this branch and submit a pull request when you're ready.

Best regards, Deep Learning Playground (DLP) Team

karkir0003 commented 1 year ago

@dwu359 can you provide more detail?

karkir0003 commented 1 year ago

can you provide more detail here?

andrewpeng02 commented 7 months ago

Why aws appsync instead of websockets?

karkir0003 commented 7 months ago

@dwu359

dwu359 commented 7 months ago

Appsync seems to handle the websockets stuff for us, but if you are able to find a way to implement it via websockets, then go for it. I will say though that I looked into implementing it via websockets before and the library support for websockets isn't as good as rest apis.

andrewpeng02 commented 7 months ago

I just don't see the need to use another service, and it'll also complicate development (we'd have to deploy to some staging env every time we want to test something?). I'll look into libraries

andrewpeng02 commented 7 months ago

What other uses of websockets do you think we'd want to add in the future?

andrewpeng02 commented 7 months ago

Django channels seem to be the accepted library for websockets, and the implementation won't be too bad. The one thing is we'd probably have to port our training methods as new websocket consumers and also deal with authentication a bit differently in a middleware. So, it seems like either:

Port the entire train endpoints into websockets. Ninja schemas and stuff may not be supported?
Define a websocket to just check on the current training epoch, will require 2 separate requests and figuring out how to connect the two will be annoying but it'll involve less refactoring (we can likely just create a job uuid on the client side and pass it to the endpoint and websocket)
Retain the original HTTP training endpoints so we don't have to create new authentication and we have schema support for the input, but instead of doing the training in the endpoint, create a task via Celery and return the job id to the user (this is better for long-running tasks too). Then, the user will open a websocket with Django Channels and the Celery task will update the websocket group periodically with the progress and eventually return the result. Long-term, using Celery tasks would be best if we're planning on having long running train times especially with image data.

In terms of effort, 2 < 1 = 3

karkir0003 commented 7 months ago

@dwu359 ?

dwu359 commented 7 months ago

Django channels seems like a good start, keep in mind you will need to find some way to host the websockets server (likely thru ec2) and access it (likely through api gateway or something else). I'm sorry I can't help much further, I'm no longer a direct contributor to this project and it seems like at this point you know more about websockets than I do.