[Feature Request] Enable Biocparallel-based execution of registration_wrapper steps

berniejmulvey commented 1 year ago

Related to issue #48 , the runtime for duplicateCorrelation can be on the order of 10+ hours for especially large datasets with dozens of samples and dozens of clusters. Splitting out this process to support a Biocparallel-compatible implementation, if possible, would make spatial registration using gold-standard, but atlas-complexity datasets (e.g. Allen Atlas whole mouse brain and human brain datasets preprinted in 2023) much more tractable.

lcolladotor commented 11 months ago

duplicate correlation

duplicateCorrelation() is a limma function and I don't think that we can parallelize it. While I know that it can take a bit of time to run, I'm a bit surprised that you have a use case where it takes 10 hours to run. Like that happened (up to 3 days) with https://github.com/LieberInstitute/brainseq_phase2/blob/127135696e216061588307e14520003fc410fcbe/development/limma_dev.R#L140-L150, but that dataset had 900 samples and thousands of exon-exon junctions, transcripts, etc. With ~10k genes and say 10 clusters across 20-30 samples (200 to 300 columns), I wouldn't expect it to take longer than an hour.

pseudo-bulking

In our experience (cc'ing @lahuuki here), the step that takes the longest to run is scuttle::aggregateAcrossCells() https://github.com/LieberInstitute/spatialLIBD/blob/be5be0e0354f02c5e3c349822fc466f625fc382b/R/registration_pseudobulk.R#L88. That function is not parallelizable by default. It could be done, but it could blow up the memory usage.

It seems like on the new tidy universe by Stefano Magnolia @stemangiola and company that their function has a comparable performance https://x.com/steman_research/status/1695025208229548063?s=20. We've also heard from talks by Gabriel Hoffman @GabrielHoffman that he has been working on a fast pseudo bulking function.

GabrielHoffman commented 11 months ago

Hi Leo, et al.

1) duplicateCorrelation() is cubic time in the number of samples, and makes the very strong assumption that the contribution of the random effect is the same across genes. See Hoffman et al. 2021. Bioinformatics describing the dream() method in the variancePartition package. dream() is substantially faster, and relaxes this strong assumption. It is compatible with the limma workflow.

2) I have extended dream() to single cell data in the dreamlet package. See Hoffman et al, 2023, biorxiv, currently in revision.

This includes a fast, low memory method to compute pseudobulk from an H5AD using on-disk memory.

I designed these packages to be used by the broader community, so let me know if you have any issues

Best, Gabriel

berniejmulvey commented 11 months ago

@GabrielHoffman @lcolladotor I used dreamlet for this exact purpose and it worked flawlessly and very quickly--definitely would recommend!

(The dataset I was trying to analyze had brain tissue from 90-some different mice and a few hundred thousand cells across as many as 500-some clusters depending on how deep in their cluster hierarchy I was looking.)

LieberInstitute / spatialLIBD

[Feature Request] Enable Biocparallel-based execution of registration_wrapper steps #49

duplicate correlation

pseudo-bulking