OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 10 forks source link

Parallel batchApply #6

Closed tomseinen closed 3 years ago

tomseinen commented 3 years ago

Hello

I was wondering if it would be possible to parallelize the batchApply function. Similar to for example the future::future_apply() function (overview).

It would then be possible to apply a function to multiple batches in parallel, speeding up the operation.

Example: Currently, the batchApply function is used by the PatientLevelPrediction package to convert a covariateData object into a sparse matrix. [link to code] For a large covariateData object this operation can take quite a while using one CPU.

Would the architecture of Andromeda allow for this apply function to be parallelized? One practical solution could be to use the foreach and doParallel functions to split up the batch operations.

I am happy to try and implement this, but I am wondering if maybe someone has other thoughts or ideas on this.

Thanks, Tom

schuemie commented 3 years ago

This would be non-trivial, since each batch would need to be sent to the processing node right after reading (can't read all batches and then distribute them, because that might not fit in memory). OHDSI's framework for parallel processing is ParallelLogger, and we'd need to take apart the clusterApply function to make that happen.

But I'm actually pretty sure the reason why that function in PLP is slow is simply because of the repeated appending to the same data object in memory. For every 100,000 sparse rows in the data, a new, ever increasing copy of the large matrix needs to be created, and allocating the memory for that object will take a long time, every time. To create a giant data frame from batch calls you would create all the small data frames and then do a single call to do.call(rbind, dfs), or dplyr::bind_rows(dfs). I guess there's something similar you could do for Matrix::sparseMatrix objects?

(Also, the global variable assignment <<- should really not be used in package functions).

tomseinen commented 3 years ago

Thank you for your explanation Martijn. I see what you mean, you are right. I was looking at the wrong function to solve my issue, the PLP code made it slow.

I solved it by simply skipping Andromeda batchApply all together and edited the PLP to directly put the covariateData in a sparseMatrix object. Time went down from 2 hours to 1 min...

I will take this to the PLP team

Thank you!