DouglasNeuroInformatics / DataBank

An open-source, web-based platform for managing, versioning, and sharing tabular datasets
https://databank.douglasneuroinformatics.ca/
GNU Affero General Public License v3.0
8 stars 5 forks source link

Performance bottle neck investigation for large dataset #101

Open Flyinchicken opened 2 months ago

Flyinchicken commented 2 months ago

Large dataset upload

  1. It currently takes about 20 minutes to upload the latest merge of the PEPP data.
  2. Why uploading is slow? The CSV file is parsed into a Polars data frame. The final data structure of the dataset stored in the database needs to be computed by loop over the entire data frame column by column. See the createFromSeries() function in api/src/columns/columns.service.ts
  3. Currently, the frontend shows a spinner and the user cannot perform any other actions while waiting for the upload. It would be nice to have a separate thread on the server to handle the file and data frame parsing.
  4. The dataset should have a 'PROCESSING' state which does not allow users to view it. When processing is done, the state of the new dataset should be set to either 'FAILED‘ or `SUCCESS'.

Large dataset view and manage

  1. it takes a long time to compute the view of a dataset.
  2. Changing the page also causes recomputing which is slow.
  3. Plausible solution: cache different views of each dataset using CacheManager provided in NestJS. If possible, preload the cache to avoid slow first access.