DSGT-DLP / Deep-Learning-Playground

Web Application where people new to Deep Learning can input a dataset and toy around with basic Pytorch modules without writing any code
MIT License
24 stars 8 forks source link

[FEATURE]: Create Train and Test Datasets from User-Uploaded Dataset in S3 for /training #913

Open dwu359 opened 1 year ago

dwu359 commented 1 year ago

Feature Name

Create Train and Test Datasets from S3 for /training

Your Name

Daniel Wu

Description

As of right now, the training backend can only handle default datasets for /tabular. Allow user-uploaded datasets to be used for tabular training by implementing a dataset creator in training/dataset.py to allow the /tabular endpoint route to read a file from s3 given the filename and split it into train and test datasets.

Right now, datasets are stored in s3 in the dlp-upload-bucket in the location {uid}/{trainspace_type}/{filename}.

You can upload files to the bucket with https://em9iri9g4j.execute-api.us-west-2.amazonaws.com/ SST prod endpoint and /datasets/user/{type}/{filename}/presigned_upload_url route. EDIT: The above statement is not true, see below

You will need a bearer token also, which can be obtained using the backend cli. For more info, cd training && poetry run python cli.py --help.

github-actions[bot] commented 1 year ago

Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. :wave:

We will look into it and provide a response as soon as possible.

To work on this feature request, you can follow these branch setup instructions:

  1. Checkout the main branch:

     git checkout nextjs
  2. Pull the latest changes from the remote main branch:

     git pull origin nextjs
  3. Create a new branch specific to this feature request using the issue number:

     git checkout -b feature-913

    Feel free to make the necessary changes in this branch and submit a pull request when you're ready.

    Best regards, Deep Learning Playground (DLP) Team

karkir0003 commented 1 year ago

@NMBridges youre doing this task

dwu359 commented 1 year ago

@NMBridges My bad, this task should deal with reading the dataset files from s3 into training, not writing files to s3.

karkir0003 commented 1 year ago

https://github.com/DSGT-DLP/Deep-Learning-Playground/blob/nextjs/training/training/core/dataset.py

should be the file to implement this endpoint in @NMBridges

karkir0003 commented 1 year ago

@NMBridges also, assume the scope of this use case to be for tabular (so reading CSV from S3 and then building train/test dataset). See example dataset creator class in the linked file