EpistasisLab / Aliro

Aliro: AI-Driven Data Science
https://epistasislab.github.io/Aliro
GNU General Public License v3.0
223 stars 63 forks source link

Handling the big size of CVP, CVR and QQNR in the machine learning backend. #585

Closed HyunjunA closed 1 year ago

HyunjunA commented 1 year ago

In the case when the size of CVP, CVR and QQNR is over the limited size in the machine learning backend, the experiment variable in the React.js does not have corresponding values, which causes an error in the front end.

Data name: 1201 BNG BreastTumor.tsv Error message

image

jay-m-dev commented 1 year ago

We have 2 potential solutions:

  1. Save this data as a file in the file system
  2. Break the data for these objects into multiple mongoDB documents. 2.1 explore working with BSON documents directly 2.2 explore the option to use GridFS to save this data as a file instead
nickotto commented 1 year ago

I think solution 1 is easier to build out and debug whereas solution 2 would involve more moving parts (ie keeping track of where all the documents are, how many shards we need to split the data into)

jay-m-dev commented 1 year ago

Actually solution 1 would require keeping track of where documents are. This can lead to orphaned files. GridFS takes care of this automatically for us. It saves files as chunks automatically. I'm leaning more towards solution 2.2. I've found the flow that saves these files:

  1. result files are generated (png, json, etc.) in a temporary directory
  2. a file watcher picks up these files and uploads them to the DB. If the files are .json it saves the results as a BSON document directly. Otherwise the files are sent to GridFS.

The fact that json files are sent as a BSON document is what causes the error. So, my proposed solution is to re-route the pca and tsne json files to GridFS. This will be easy to do.

The second part of this solution involves the retrieval of these files. I would just need to retrieve them from GridFS. These files are retrieved by the frontend to be rendered along with the corresponding png files. This also should not be hard to implement, I just need to review the retrieval flow.

jay-m-dev commented 1 year ago

Re-opening issue. Tested but fix is unstable. After uploading a dataset, Aliro goes back to the landing page without showing the card for the uploaded dataset. If I try to upload again, I get an error message saying the dataset has already been registered. Steps to recreate:

  1. docker system prune -a -f
  2. docker volume prune -a -f
  3. docker compose up
  4. Go to localhost:5080
  5. Upload dataset 6. Landing page will be shown without the uploaded dataset card
jay-m-dev commented 1 year ago

Reviewed updates with @HyunjunA. Confirmed new updates are stable.