KChen-lab / MEDALT

Inference of Minimal Event Distance Aneuploidy Lineage Tree based on single cell copy number profile
MIT License
16 stars 2 forks source link

Support for 10X / high N #12

Open anderswe opened 1 year ago

anderswe commented 1 year ago

Hi Fang and Qihan,

Very grateful for your work with MEDALT! Excited to try this out.

Do you have any recommendations for running it using 10X single cell data? i.e. datasets with high N and low read depth?

So far, I'm running out of memory (currently 180gb on our institution's cluster) with anything larger than 2k cells or so.

Thanks! Anders

jpark27 commented 4 months ago

Hi, I come across same issue with 3TB of memory is not enough to run on > 7k cells. Any chance @anderswe find solution? Or @jinzhuangdou can comment on such memory issue?

jinzhuangdou commented 4 months ago

Are you using the lastest version? We have some optimization on the memory issue.

jpark27 commented 4 months ago

Hi, @jinzhuangdou

I git clone the master branch few weeks ago. Would there be another version?

jpark27 commented 4 months ago

@jinzhuangdou Do you think if I share the input file (e.g., infercnv output) it might help with troubleshooting together?

Input file (17K cells; infercnvpy output): https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

To read: Had to change line 24 on dataTransfer.R to data=read.csv(inputfile,sep="\t",row.names=1) to read without error

jinzhuangdou commented 3 months ago

Received with thanking you. We are testing the performance on memory usage with your input data. Will let you know once we have some ideas. Thanks

jinzhuangdou commented 3 months ago

Hi @jpark27 , could you test the new version that supports Python 3 to assess its memory usage? The main script is SC1_py_sctree.py

python3 SC1_py_sctree.py -P ./ -I ./example/scDNA.CNV.txt -D D -G hg19 -O ./example/outputDNA

jpark27 commented 3 months ago

Hi, @jinzhuangdou! Thank you so much for the suggestion and I tried following command with similar size of input file on lsf with 60 cores, 3TB memory. However, even after 24hrs, it stuck at step2/3 as follows. Do you think its normal to take such long time (c.f. example scRNA.CNV.txt took < 1min with same set up)? or something wrong with current input file structure...

python3 SC1_py_sctree.py -P ./ -I ~/BB18_indexed.txt -D R -G hg38 -O ~/outputRNA -W 200

image

jinzhuangdou commented 3 months ago

Hi @jpark27 , thank you for the update. It may require a large amount of memory when processing over 10K cells, especially considering the iterative construction of the MST tree across all cells. Could you utilize hierarchical clustering to identify different branches and then employ MEDALT to build the local branch tree within each cluster? This strategy has the potential to significantly reduce memory usage while maintaining the integrity of the analysis.

jpark27 commented 3 months ago

Hi, @jinzhuangdou! It makes sense to have large memory (currently, I set maximum on our lsf so I will leave it and have a look for few days).

That's good idea, I will split dataset into small chunks (cluster by cluster) and run MEDALT. As I am not super bioinformatic-savvy, would you recommend any specific tool or python package to do such hierarchical clustering before running MEDALT?

jpark27 commented 3 months ago

Hi, @jinzhuangdou! Hope you been well. I have been trying subsetting the input file* into hierarchical cluster as suggested and re-run the analysis but still seem stuck at same step even after few days with large memory. Any chance would there be systemic issue of MEDALT handling >16K genes?

#####################################################

now running SC2_RR_dataTransfer.R

#####################################################

16146/16452 genes matched in ref_seq. saved file: 2_BB45_cluster5_bin_200.csv


Input file (17K cells; 16K genes infercnvpy output): https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

Input file2 (0.3K cells; 16K genes infercnvpy output): https://drive.google.com/file/d/1arTjMpyZuj3s_NBnwglKv3lJj5xoaLsn/view?usp=drive_link