It currently takes about 20 minutes to upload the latest merge of the PEPP data.
Why uploading is slow? The CSV file is parsed into a Polars data frame. The final data structure of the dataset stored in the database needs to be computed by loop over the entire data frame column by column. See the createFromSeries() function in api/src/columns/columns.service.ts
Currently, the frontend shows a spinner and the user cannot perform any other actions while waiting for the upload. It would be nice to have a separate thread on the server to handle the file and data frame parsing.
The dataset should have a 'PROCESSING' state which does not allow users to view it. When processing is done, the state of the new dataset should be set to either 'FAILED‘ or `SUCCESS'.
Large dataset view and manage
it takes a long time to compute the view of a dataset.
Changing the page also causes recomputing which is slow.
Plausible solution: cache different views of each dataset using CacheManager provided in NestJS. If possible, preload the cache to avoid slow first access.
Large dataset upload
createFromSeries()
function inapi/src/columns/columns.service.ts
Large dataset view and manage