Sydney-Informatics-Hub / Somatic-shortV-nf

A nextflow workflow for calling Somatic short Variant using gatk
GNU General Public License v3.0
0 stars 1 forks source link

Benchmark with human T/N samples #4

Open georgiesamaha opened 8 months ago

georgiesamaha commented 8 months ago

Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.

Activities

Requirements

nandan75 commented 7 months ago

Takeaways from discussion with Cali Willet about benchmarking steps

I think the first thing would be to run each step at the settings Tracy last had them, and then try a tweak here and there, eg if she had 1 CPU and 4 GB mem for a GATK task, try:

Different intervals run that setting for

Does your workflow check the samples for coverage, either via a required field in the sample inputs file or from a pre-processing step as part of the worfklow? Definitely relevant. Having the user provide it in the config is one option, but this can be a pain for the user. You can deduce it very easily from a number of methods, for example bam coverage tools, raw estimates from fastqc etc

most TN datasets re coming in at 30/60 but some are 45/90, 30/90, etc. And this may change (will change!) as sequencing gets even cheaper, so having it a feature now will save trouble later. The coverage value will mainly impact the walltime that is requested and the mem and of course if the workflow is to be run on cloud, having the coverage both benchmarked and incorporated into the workflow will help the user budget

gadi usage script

if you add this line to your bashrc, you can run it simply:

alias usage='perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl' https://teams.microsoft.com/l/message/19:0dc09393-cd81-4909-9ee7-2070f4a3b48a_bbddd589-9092-40df-ad27-03047644fcbf@unq.gbl.spaces/1705623618471?context=%7B%22contextType%22%3A%22chat%22%7D

Interval sizes back to the intervals question - 3200 was a good value for making our NCMAS applications look cool. you dont need to stick with that. but i wouldnt go any smaller than that, as you noted re the minimum Mb size. It may be pertinent to revisit the interval number, and take a look at which job and which interval is the slowest, and decide if you think a decrease is warranted. I would not want to see any single scatter-gather step with a walltime > 1 hour.

To summarise all the talk on benchmarking, there are a few overarching goals:

1) Determine the optimal resources to request for a job or task given the nature of the data

nandan75 commented 7 months ago

Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.

Activities

  • [x] Index sample Bams
  • [x] Execute workflow using Gadi script

  • A few runs were excecuted by changing number of intervals (no multi-threading steps in the Somatic-shortV pipeline, so one of the parameters to be tweaked is number of intervals).

  • Used the script perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl - but this gives a overall (not task specific view)

  • [ ] Collect benchmarks at the task level using gadi_nfcore_report.sh

  • I will use this script.

  • I am not too sure at this time how do we view all mutect2 tasks (scattered) together as a composite resource view- will check with someone.

  • [ ] Add benchmarks in table format to README. Structure as per existing SIH Gadi workflows

  • [ ] Add description of the compute environment it was run in as per existing SIH Gadi workflows

Requirements

  • [ ] Resolved generation of own intervals
  • [ ] Workflow README complete