Run time and dimensionality reduction

sroyyors commented 3 years ago

hi, I am running unioncom on integrate two datasets one with 5k cells and one with 50k cells. I gave the full feature matrix and it was going slowly, so I decided to reduce the dimensionality to 10 and 15 for each. But now it is even slower and not really doing anything. I am wondering if you have an idea of the runtime and if you recommend doing any dimensionality reduction. I did NMF to reduce dimensionality.

sroyyors commented 3 years ago

hi sorry, to clarify the case where I saw the algorithm making some progress had 2k and 20k cells. But this had all the features. When I went to 5k and 50k, things have not been moving forward. Just wanted to clarify that dimensionality reduction of 5, and 50k did not help either.

caokai1073 commented 3 years ago

Hi， For distance-based or kernel-based algorithm, it is difficult to be scalable to very large-scale datasets with cells up to ~10^6 because the computational complexity depends on the number of samples. We are still working on it. But I think 5k or 50k cells can be handled by UnionCom.

Here are some ideas:

Because UnionCom involves a lot of matrix operations, it can be accelerated by an efficient GPU device. If you have a GPU, you can give it a try.
You can set the parameter "log_pd" to "1". This forces the program to print once for every step it runs.
You can first randomly sample some cells (e.g., 1k cells) and run UnionCom and see how efficient is it.
Besides, have you tried other dimensionality reduction methods such as PCA?
We recently have developed a new framework named Pamona for single-cell multi-omics integration, which is based on optimal transport. Pamona can be computed by CPU efficiently. If you are interested, you can have a try, too. (https://github.com/caokai1073/Pamona)

caokai1073 / UnionCom

Run time and dimensionality reduction #3