broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
557 stars 164 forks source link

[Memory Shortage] Using random trees on ST dataset (>640 GB not enough) #615

Open jpark27 opened 11 months ago

jpark27 commented 11 months ago

Dear @GeorgescuC

Hi, George! Thank you so much for sharing great tool (infercnv) and constant support/feedbacks to users. I happen to use on 10x visium (15K spots) dataset using random trees option (v.1.17.0) like following but it constantly crash on step7 on lsf with 640 GB/60 CPUs. I tried suggestions like

infercnv::run(bb4_infcnv_obj, cutoff=0.1, out_dir=out_dir, cluster_by_groups=F, #or, T HMM=T, denoise=T, plot_steps=F, analysis_mode='subclusters', tumor_subcluster_partition_method='random_trees', useRaster = F, num_threads=60)

[1] system("ulimit -s unlimited", intern=TRUE) system("ulimit -a", intern=TRUE)

[2] useRaster = F on run command

[3] options(expressions=500000) before running singularity image on lsf

but still spitting memory shortage issue. ` Warning messages:

1: In asMethod(object) : sparse->dense coercion: allocating vector of size 1.1 GiB 2: In asMethod(object) : sparse->dense coercion: allocating vector of size 1.1 GiB

Execution halted Terminated ` Any chance can you have look at error/log mssgg together and give some feedbacks how can I resolve the issues? Maybe, if I can share the data on private channel (c.f., gdrive) and have a look together? (c.f., datset needs to be splitted, use another infercnv version etc). Any comments would be appreciated..

log.txt command.txt

best wishes, Jun

GeorgescuC commented 10 months ago

Hi @jpark27 ,

I see in the log that you are running the random trees subclustering using 60 threads, which is likely the reason why the memory usage is so high. Unless you have a strong reason to use the random trees method for subclustering rather than the new default that uses Leiden, I would just remove the tumor_subcluster_partition_method='random_trees' argument. That would require way less memory, be much faster, and probably more accurate as well. In that case the argument you will probably want to test changing will be the leiden_resolution one.

If however you have a strong reason to use the random trees method (since I see your script makes mention of uphyplot2), I would try using less threads to reduce the number of copies of the data that are stored in memory during the process. Also, a thing to keep in mind with this method is that it has a tendency to split the cells in 4 clusters even if there is more diversity.

Regards, Christophe.