broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Clear guidelines needed for GermlineCNVCaller expected runtime, cpu usage and memory usage #6166

Open tfenne opened 5 years ago

tfenne commented 5 years ago

Documentation request

Tool(s) or class(es) involved

GermlineCNVCaller

Description

I'm trying to get a pipeline running to call germline CNVs on small cohorts (20-40) PCR free whole genome samples sequenced to ~45X depth. I'm running into problems figuring out how wide to scatter the analysis, and how to allocate resources.

It would be incredibly helpful to have some very clear guidelines about how number of samples and the number of intervals within each scatter affect both runtime and memory usage. Here's what I've been able to infer from the WDL pipelines, tool docs and experimentation (though I suspect some of it is wrong):

  1. Memory usage is approximately proportional to number of samples, number of intervals, number of bias covariates and max copy number. What the docs don't say is what the default is for the number of bias covariates and how to take these numbers and project an approximate memory usage.

  2. It would appear that GermlineCNVCaller will, by default, attempt to use all CPU cores available on the machine. From the WDL I see that setting environment variables MKL_NUM_THREADS and OMP_NUM_THREADS seems to control the parallelism? It would be nice if GermlineCNVCaller took a --threads and then set these before spawning the python process.

  3. Runtime? This would be really nice to have some guidelines around as I get wildly varying results depending on how I'm running. My experimentation is with a) 20 45X WGS samples, b) bin size = 500bp, c) running on a 96-core general purpose machine at AWS with 384GB of memory. My first attempt a) scattered the genome into 48 shards of approximately 115k bins each, representing ~50mb of genome and b) ran 24 jobs concurrently but failed to set the environment variables to control parallelism. In that attempt the first wave of jobs were still running after 24 hours and getting close to finishing up the initial de-noising epoch, with 3/24 having failed due to memory allocation failures. My second attempt, now running, scattered the genome into 150 shards, and is running 12 jobs at a time with 8 cores each and the environment variables set. On the second attempt it looks like the jobs will finish the first denoising epoch in < 1 hour each. That's far faster than the 6x reduction in runtime you might expect if a) runtime is linear in the number of bins and b) runtime is proportional to 1/cpus used.

Without doing a lot more experiments it's hard to tell whether the better runtime is due to less fighting over resources (I can imagine 24 jobs each running 96 threads could degrade performance) or because runtime is super-linear vs. number of bins.

I'm not asking for total precision, but the current docs are not really enough for anyone outside the GATK team to get the CNV caller up and running in an efficient manner.

droazen commented 5 years ago

@samuelklee ^^

samuelklee commented 5 years ago

Thanks for these questions, @tfenne, and glad you are experimenting with the workflow. Several Broad-internal groups are running various WGS and WES analyses and are seeing encouraging performance, so I’m looking forward to hearing feedback from you as well.

However, I am currently indisposed and may be out for the next month or so, but @mwalker174 (who has now taken over from me as CNV tech lead, with a focus on the germline workflows) and @asmirnov239 should be able to point you in the right direction and respond to you in more detail. In the meantime, you may find some pointers in various forum posts or issues here on GitHub.

We’re looking into improving documentation processes for runtime/requirements GATK-wide, which should help with this sort of thing in the future.

tfenne commented 5 years ago

Thanks @samuelklee. I have taken a read through most of the forum posts that I could find that reference the GermlineCNVCaller or just germline cnv. I've also read through the tutorial dock, and poked around in the published WDL.

I've heard good things about the results of the GATK germline CNV caller, and thus would really like to get it performing well in my own hands. I think at this point I have it so that I should be able to generate calls for my 20 samples in 15-20 hours on a 96-core machine. If the results look better than the alternatives then I'll probably end up spending more time trying to figure out how best to run things to minimize runtime/cost while maintaining sensitivity. I'll keep you posted.

ldgauthier commented 5 years ago

@tfenne a Terra workspace prototype I started ran the cohort caller in 39 shards and took 2.5 hrs, which ended up being $3.25. This was on 1000G exomes where at least 80% of targets were covered to at least 20X. I'm happy to share it with you if you have a Firecloud/Terra account.

mwalker174 commented 5 years ago

@tfenne Most of your assumptions about scaling sound right but we usually scatter across VMs, so I'm not sure how it will scale running on a single VM. Here are the parameters we have been using internally for WGS at 2kbp bin size. For WGS we have not seen much benefit to using bias factors, so they are disabled and that should help speed things up considerably.

  "TrainGCNV.num_intervals_per_scatter": "5000",
  "TrainGCNV.do_explicit_gc_correction": "true",
  "TrainGCNV.gcnv_enable_bias_factors": "false",

  "TrainGCNV.mem_gb_for_determine_germline_contig_ploidy": "15",
  "TrainGCNV.mem_gb_for_gcnv_cohort_caller": "15",
  "TrainGCNV.cpu_for_gcnv_cohort_caller": "4",

  "TrainGCNV.cpu_for_determine_germline_contig_ploidy": "4",
  "TrainGCNV.gcnv_caller_update_convergence_threshold": "0.000001",
  "TrainGCNV.gcnv_class_coherence_length": "1000",
  "TrainGCNV.gcnv_convergence_snr_trigger_threshold": "0.2",
  "TrainGCNV.gcnv_interval_psi_scale": "0.000001",
  "TrainGCNV.gcnv_log_mean_bias_standard_deviation": "0.01",
  "TrainGCNV.gcnv_max_bias_factors": "1",
  "TrainGCNV.gcnv_max_calling_iters": "20",
  "TrainGCNV.ploidy_global_psi_scale": "0.05",
  "TrainGCNV.ploidy_mean_bias_standard_deviation": "1",

  "TrainGCNV.gcnv_depth_correction_tau": "10000",
  "TrainGCNV.gcnv_log_emission_sampling_median_rel_error": "0.001",
  "TrainGCNV.postprocessing_mem_gb": "15",

  "TrainGCNV.gatk_docker" : "broadinstitute/gatk:4.1.2.0",

  "TrainGCNV.gcnv_model_learning_rate" : 0.03,
  "TrainGCNV.gcnv_model_num_thermal_advi_iters" : 2500,
  "TrainGCNV.gcnv_model_max_advi_iter_first_epoch" : 5000,
  "TrainGCNV.gcnv_model_max_advi_iter_subsequent_epochs" : 200,
  "TrainGCNV.gcnv_model_max_training_epochs" : 50,
  "TrainGCNV.gcnv_model_min_training_epochs" : 5,
  "TrainGCNV.gcnv_model_convergence_snr_averaging_window" : 500,
  "TrainGCNV.gcnv_model_convergence_snr_countdown_window" : 10,
  "TrainGCNV.gcnv_model_cnv_coherence_length" : 1000,
  "TrainGCNV.gcnv_copy_number_posterior_expectation_mode" : "EXACT",

  "TrainGCNV.gcnv_learning_rate" : 0.03,
  "TrainGCNV.gcnv_num_thermal_advi_iters" : 250,
  "TrainGCNV.gcnv_max_advi_iter_first_epoch" : 5000,
  "TrainGCNV.gcnv_max_advi_iter_subsequent_epochs" : 200,
  "TrainGCNV.gcnv_max_training_epochs" : 50,
  "TrainGCNV.gcnv_min_training_epochs" : 5,
  "TrainGCNV.gcnv_convergence_snr_averaging_window" : 100,
  "TrainGCNV.gcnv_convergence_snr_countdown_window" : 10,
  "TrainGCNV.gcnv_cnv_coherence_length" : 1000,
  "TrainGCNV.gcnv_copy_number_posterior_expectation_mode" : "EXACT",

  "TrainGCNV.gcnv_log_emission_sampling_rounds" : 20,
  "TrainGCNV.gcnv_p_alt" : 0.000001,
  "TrainGCNV.gcnv_sample_psi_scale" : 0.000001,
  "TrainGCNV.ploidy_sample_psi_scale" : 0.001,
Jolleboll commented 2 years ago

Hello, I'd like to add that I've been running GermlineCNVCaller for several days now, on a 256 core machine:

singularity exec image/gatk.sif gatk GermlineCNVCaller\
            --run-mode COHORT\
            --interval-merging-rule OVERLAPPING_ONLY\
                        etc.

Should it take this long? Why? It is using 300 GB memory, and intermittently uses all of the 256 cores. 30 WGS samples, each sample's bam file is around 50 GB in size, so coverage is normal.

Grateful for help!

asmirnov239 commented 2 years ago

Hi @Jolleboll, that depends on number of intervals in your count matrix.

In general we ask to submit these types of questions to our community forum: https://gatk.broadinstitute.org/hc/en-us/community/topics. Please post your question there with the corresponding details about the number of intervals, and tool's standard output. Thank you!

samuelklee commented 2 years ago

@Jolleboll I suspect that you may not be sharding runs of gCNV in cohort mode over the genome, which we typically recommend. Perhaps you can check whether this is the case? See e.g. https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-common-and-rare-germline-copy-number-variants. As Andrey suggests, please feel free to post on the forums so that others in the community can benefit from the discussion!

Jolleboll commented 2 years ago

Hello @asmirnov239 @samuelklee and thank you very much for replying so quickly!

It just finished minutes ago, it took six days. As you suspected, I did not split the intervals file into chunks, had no idea it was possible. The intervals were 1kb. Perhaps this explains it, though it was much much faster for the WES data

Thanks again!