broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
556 stars 164 forks source link

Is there a way to roughly predict the CPU hours for infercnv_run? #214

Closed pbrazda closed 2 years ago

pbrazda commented 4 years ago

Hi!

I have a question about CPU hours needed to run inferCNV.

I managed to set up my pipe with a downsampled version of my dataset (1000 cells). That worked fine and the run was completed in <30'.

In my "complete" infercnv object I have ~50k cells from 10x libraries. I am sending now this infercnv_run as a job on our HPC with 4 threads:

infercnv_run = infercnv::run(infercnv_object,
                                cutoff=0.1,
                                analysis_mode ="subclusters",
                                out_dir="CNV_out", 
                                cluster_by_groups=F,
                                window_length=101 , 
                                plot_steps=T,
                                denoise=T, 
                                no_prelim_plot=F,
                                k_obs_groups = 2,
                                HMM=T) 

First, I requested 20 hours and got until the end of step 4. It got killed during ward.D2 clustering.

Next, I requested 48 hours. This run managed to take up from the backups where the previous stopped, got down to step 7 , _define_signif_tumor_subclusters(), tumor: allobservations. Now that 48 hours are also gone, so it got killed.

Is there a way to roughly predict the hours for this job? Thanks!

GeorgescuC commented 4 years ago

Hi @pbrazda ,

It is hard to predict run time because that varies greatly based on which options you use and what your data looks like. In general, one of the most time consuming steps is defining the subclusters (followed by the bayesian filtering), which will take longer based on how diverse your data is (besides the normal increase based on the data size) and is the step you are on. We are currently exploring ways of improving the subclustering accuracy and speed, but we don't know when it will be available.

One thing I would recommend since you are using "'plot_steps=T" is to use the very latest version of the code (master branch, or "docker pull trinityctat/infercnv:1.3.5" ) as it greatly improves the speed of plotting at every step past when the clustering is done. (there was an issue where it would recalculate clustering for the plot even if already stored in the object).

If each step does take less than 48h and that is your request limit, rather than waiting the full time and lose some progress every time it stops, you could use the "up_to_step=21" option with values ranging from your current step to the last (21) to run only step at a time, then quit and submit the next step job.

Regards, Christophe.

pbrazda commented 4 years ago

Thank you, @GeorgescuC !

Actually, plot_steps=T is not crucial, as I only need the k=2 division. Is the final Infercnv.png still generated if I turn to plot_steps=F?

GeorgescuC commented 4 years ago

Hi @pbrazda ,

Yes, the final plot will always be generated unless you set "no_plot=TRUE".

Regards, Christophe

Zifeng-L commented 4 years ago

Hi @pbrazda ,

Yes, the final plot will always be generated unless you set "no_plot=TRUE".

Regards, Christophe

Hi!

I have a question about CPU hours needed to run inferCNV.

I managed to set up my pipe with a downsampled version of my dataset (1000 cells). That worked fine and the run was completed in <30'.

In my "complete" infercnv object I have ~50k cells from 10x libraries. I am sending now this infercnv_run as a job on our HPC with 4 threads:

infercnv_run = infercnv::run(infercnv_object,
                                cutoff=0.1,
                                analysis_mode ="subclusters",
                                out_dir="CNV_out", 
                                cluster_by_groups=F,
                                window_length=101 , 
                                plot_steps=T,
                                denoise=T, 
                                no_prelim_plot=F,
                                k_obs_groups = 2,
                                HMM=T) 

First, I requested 20 hours and got until the end of step 4. It got killed during ward.D2 clustering.

Next, I requested 48 hours. This run managed to take up from the backups where the previous stopped, got down to step 7 , _define_signif_tumor_subclusters(), tumor: allobservations. Now that 48 hours are also gone, so it got killed.

Is there a way to roughly predict the hours for this job? Thanks!

Hi @pbrazda, You said that you sent this infercnv_run as a job on our HPC with 4 threads, can it be processed by multithreading?