chriscainx / mnnpy

An implementation of MNN (Mutual Nearest Neighbors) correct in python.
BSD 3-Clause "New" or "Revised" License
71 stars 31 forks source link

no response at Performing cosine normalization... #9

Open zacharylau10 opened 6 years ago

zacharylau10 commented 6 years ago

Hi, thank you for this awesome implementation of MNN in python , it's a great work! when I use MNN to my data (~30000 genes * ~ 60000 cells) with all genes and specified HVG, it seems stuck for some reasons but without any hit. at the beginning period of run MNN, it would produce many processor which equal to my CPU processor numbers, but just 2 or 3 were live and the memory cost is up to 300GB. After that, all processor sleep without cpu using. overnight, it looks the same(>12h). Also, when I downsample to ~15000 cells, with all genes and HVG(~5000 genes) it's also huge memory cost and stuck at Performing cosine normalization.

1.Could you give some suggestions to solve those problems? 2.Could you provide script you mention in README Finishes correcting ~50000 cells/19 batches * ~30000 genes in ~12h on a 16 core 32GB mem server. I want to make sure about my script is correct.

chriscainx commented 6 years ago

Thank you! The problem at cosine normalization has been reported repeatedly, possibly due to python's multiprocessing. I will release a cython optimized version hopefully this weekend to solve it. Meanwhile, could you try mnnpy.settings.normalization = 'seq' to change the normalization behaviour and see if the problem remains? About 2: I used exact script in the README, only with more adatas.

corrected = mnnpy.mnn_correct(sample1, sample2, sample3, var_subset=hvgs, batch_categories = ["1", "2", "3"])
adata = corrected[0]

Since the scaled genes other than hvgs are usually not necessary in the following steps, you could do

sample = sample[:, hvgs]
corrected = mnnpy.mnn_correct(sample1, sample2, sample3, batch_categories = ["1", "2", "2"])
adata = corrected[0]

to significantly reduce computation tasks.

zacharylau10 commented 6 years ago

I used mnnpy.settings.normalization = 'seq', but it looks same. when I use hvgs(~2000 genes) mnnpy works well, but when I increasing up to ~5000 genes, mnnpy is still stuck in cosine normalization.

now, I prepare to using Intel Python Distribution you suggested to re-run my data.

zacharylau10 commented 6 years ago

Hi, maybe I figure it out, because the large datasets, scanpy translates data to sparse matrix and the cosine normalization can't recognize sparse matrix and lead to huge memory cost and stuck at this step. therefore, I just revisemnnpy/mnnpy/utils.py line 33

datas = [data.astype(np.float32) for data in datas] to datas = [data.toarray().astype(np.float32) for data in datas]

it seems solved the problem, maybe you could test it.

zacharylau10 commented 6 years ago

刚刚看到你在北大,能加个微信吗?我在同济读博。

chriscainx commented 6 years ago

哈哈哈666,我微信17600716991