Run(): Running out of CPU hours, how to use up_to_step?

Hello,

Thank you for maintaining such a valuable tool. I am running infercnv on a large scRNA dataset and I keep running out of time on my batch jobs. I read an earlier post that recommended using the up_to_step option to run a few steps at a time.

I've tried to save the infercnv object after completing a few steps and running the next few steps in another job (for example, saving the object after step 7, then reloading and setting up_to_step to 11). However, when I try to run the next steps, it restarts from the beginning. Is there a way to skip the steps that have already been run? Currently using infercnv_1.6.0. Code below:

curr_step = 3    

infercnv_obj_default <- readRDS(paste0("infercnv_step_",curr_step,".RDS"))               

curr_step = curr_step + 4 

out_dir = tempfile()                                  
infercnv_obj_default = infercnv::run(
    infercnv_obj_default,
    cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
    out_dir=out_dir,
    cluster_by_groups=TRUE, 
    plot_steps=FALSE,
    denoise=TRUE,
    HMM=FALSE,
    no_prelim_plot=TRUE,
    png_res=60,
    up_to_step = curr_step
)       

saveRDS(infercnv_obj_default, paste0("infercnv_step_",curr_step,".RDS"))

Hi @Dzhan4 ,

The save/reload build-in to infercnv automatically checks for saved results from a previous run in the output folder and compares that with your current run's settings and input data (checksum of the expression matrix). If they match, the backup is used as a starting point and steps already processed should be skipped automatically. It does not take as input the results from a given step to keep going from there as there is no internal tracking of what steps the process finished. This is to make sure that the correct steps and settings have been run/used for the early steps (for example setting HMM to True requires spike-in data to be generated as early as step 3 even though the HMM itself is only run at step 17).

The log will still list the steps that are skipped, however you should see a message to inform you of the reload happening at the beginning of the process when a usable backup is found: using backup from step i. The object you provide as input to the run() method should always be the non-processed/newly created object (which is the one saved as 01_incoming_data.infercnv_obj) as if a backup is not found, the raw matrix needs to be present and used for the process.

What seems to be the issue in your case is that you have set out_dir = tempfile() which is likely to change from one run to the other, so the backups that infercnv wants to reload and compare your current run to are not found, so it starts the process from the start (and using the incorrect expression matrix as the one you provide is already partially processed and is not the raw counts anymore). Setting the out_dir to a fixed path should solve the problem, but make sure to restart the run with a freshly created infercnv object as your current backups may be in a weird state due to rerunning the same steps over.

Regards, Christophe.

broadinstitute / infercnv

Run(): Running out of CPU hours, how to use up_to_step? #470