Open LucasESBS opened 6 years ago
Thanks for the encouragement Lucas!
I didn’t encounter this problem during my tests, looks odd... It is definitely not related to the number of HVGs though, in my case I used >5000 HVGs and no error occurred. This looks more like a python internal issue, maybe it has something to do with the Numba JIT system. I will look into this and keep you posted.
Regarding time consumption, if you choose to input all the genes (g) but subset with hvgs (h), the algorithm will run twice (m*n*g + m*n*h, m=batch 1, n=batch 2), because the whole space needs to be projected using the MNNs calculated using h. Calculating MNN uses only h, but it’s fast. When h is small, the majority of time is consumed by all gene computation, changing h will not affect running time very much.
If you want to speed it up, a good plan would be subset the genes to hvgs first, and do no subsetting, since other genes are typically unnecessary in later steps (PCA, cluster) anyway... In this case, the time will be m*n*h. I think this is most efficient.
Actually, what’s the specs of your machine? It’s possible that your problem has something to do with it.
Hi Lucas, could you update to 0.1.9.3 and check if the error occurs? Thank you!
Thank you for your reply ! Sorry for not getting back to you earlier. I tried to update to newest version but the error is unfortunately still there. Specs of my machine are:
Linux x86_64 GNU/Linux MemTotal: 527818508 kB Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (56)
Hope that helps, Thanks again for your help!
I ran into this, as well. Appears to be multiprocessing. If I do mnnpy.settings.normalization = "seq"
, things proceed past the cosine normalization step.
I have 3 AnnData objects: AnnData object with n_obs × n_vars = 15992 × 41861 AnnData object with n_obs × n_vars = 13998 × 41861 AnnData object with n_obs × n_vars = 13325 × 41861
I can get it to run in parallel if I subset the AnnData objects:
corrected = mnnpy.mnn_correct(a2data[0:1000], b2data[0:1000], b3data[0:1000], var_subset=hvg, batch_categories=["a2", "b2", "c2"])
Hello,
First, thank you for this awesome implementation of MNN in python , great work! I was looking forward to try it and compare it with Seurat's CCA on my 5 batches, but got the following error when running the mnn_correct function:
I used 500 highly variable genes; is that too much ? Can it be for another reason? In general, how do the computational time scales with the number of genes selected ? It would be nice to have an idea of a good number of HVGs to be included to have a decent running time without losing too much information.
Thank you, and once again congrats for your work!