federicomarini / quantiseqr

https://federicomarini.github.io/quantiseqr/
GNU General Public License v3.0
0 stars 2 forks source link

Parallelize when possible? #10

Open federicomarini opened 3 years ago

federicomarini commented 3 years ago

As the samples get processed one by one, it might be of interest to try and parallelize that so that runtimes might be significantly shortened, especially when running many samples at once

BiocParallel might be providing a very nice & convenient way to do so

federicomarini commented 3 years ago

Related to this: I did some profiling on the main function to run quantiseq, and basically noticed that the bottleneck is actually prior to that, namely in the mapGenes function. So, after some in-depth debugging I came to think that the solution in here https://github.com/federicomarini/quantiseqr/commit/e0a87313e74f4b4d163a343f699c58e6af4a63e5 should be robust enough. Maybe worth porting to the current state of immunedeconv, so I am pinging @grst on this 😉

Then: an additional thing to be done would be to do the aggregation only on the lines that have the duplicate row names, so that would speed it up "massively enough" to the extent we won't really need to parallelize. Happy to wrap up a tiny PR if you're all good on this!

grst commented 3 years ago

A dplyr groupby(gene_symbol) %>% summarise_all(sum)) should be considerably faster than base R.

Happy to include the parallelized version into immunedeconv, but probably it's easiest to wait until this package is more or less ready and then port immunedeconv to use it as a dependency.

federicomarini commented 3 years ago

As of now no parallelization is done, just a conditional check - from my understanding, this aggregation needs to be done only if any rownames are duplicated.

But as you said: probably best to give it the time to sediment in here and then just use it as Imports