Benchmark with human T/N samples

georgiesamaha commented 8 months ago

Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.

Activities

[x] Index sample Bams
[x] Execute workflow using Gadi script
[ ] Collect benchmarks at the task level using gadi_nfcore_report.sh
[ ] Add benchmarks in table format to README. Structure as per existing SIH Gadi workflows
[ ] Add description of the compute environment it was run in as per existing SIH Gadi workflows

Requirements

[ ] Resolved generation of own intervals
[ ] Workflow README complete

nandan75 commented 7 months ago

Takeaways from discussion with Cali Willet about benchmarking steps

For the GATK steps that do not multithread (none), since we don't need to test out different CPU values like we would for multithreading tools
We need to focus on memory and queue. e.g does the tool run faster if given more RAM?
- So you would start by testing 1 CPU, 4 GB RAM normal nodes, setting max java mem to 2 GB,
- Compare that to 1 CP 31 GB RAM setting max java mem to 28 GB.
- You might also test broadwell nodes - eg 1 CPU, 7 GB RAM and 4 or 5 GB max java mem

I think the first thing would be to run each step at the settings Tracy last had them, and then try a tweak here and there, eg if she had 1 CPU and 4 GB mem for a GATK task, try:

2 normal CPU and 8 GB mem with 6 GB java mem
1 broadwell CPU and 7 GB mem with 5 java mem
1 hugemem CPU and 31 GB mem with 28 java mem Do that for a handful of intervals (separately for tumour and normal since the coverage and thus resource usage is different) determine the "best setting" ie the one that provides the best tradeoff between CPU efficneyc, KSU usage and walltime, and then, in a run_parallel.pbs method:

Different intervals run that setting for

10 intervals for each sample
then again for 100 intervals
and 1000 intervals
then 3200 intervals Plot the usage and CPU efficiency. As the job gets "wider" ie giving it more nodes, does the CPU efficiency remain the same?

Does your workflow check the samples for coverage, either via a required field in the sample inputs file or from a pre-processing step as part of the worfklow? Definitely relevant. Having the user provide it in the config is one option, but this can be a pain for the user. You can deduce it very easily from a number of methods, for example bam coverage tools, raw estimates from fastqc etc

most TN datasets re coming in at 30/60 but some are 45/90, 30/90, etc. And this may change (will change!) as sequencing gets even cheaper, so having it a feature now will save trouble later. The coverage value will mainly impact the walltime that is requested and the mem and of course if the workflow is to be run on cloud, having the coverage both benchmarked and incorporated into the workflow will help the user budget

gadi usage script

if you add this line to your bashrc, you can run it simply:

alias usage='perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl' https://teams.microsoft.com/l/message/19:0dc09393-cd81-4909-9ee7-2070f4a3b48a_bbddd589-9092-40df-ad27-03047644fcbf@unq.gbl.spaces/1705623618471?context=%7B%22contextType%22%3A%22chat%22%7D

cd into the PBS logs dir and just type 'usage'. The original version had some more flexibility about running on specific or groups of logs but Tracy updated to v 1.1 and those functions no longer work. I intend to update it again soon so feedback on proposed chnages (eg a 'usage -brief' option!) would be welcome
after you run the pbs job. ensure all your job logs are emit into the one directory, after they are run, cd into it and just run 'usage'
remember that after you save that to bashrc, it wont be saved as an alias until you either run 'source ~/.bashrc' OR start a new session

Interval sizes back to the intervals question - 3200 was a good value for making our NCMAS applications look cool. you dont need to stick with that. but i wouldnt go any smaller than that, as you noted re the minimum Mb size. It may be pertinent to revisit the interval number, and take a look at which job and which interval is the slowest, and decide if you think a decrease is warranted. I would not want to see any single scatter-gather step with a walltime > 1 hour.

To summarise all the talk on benchmarking, there are a few overarching goals:

1) Determine the optimal resources to request for a job or task given the nature of the data

Depends on many factors including sample coverage, cohort size, genome size, hardware
2) Best resource identification is critical to ensuring no resources are wasted:
- CPU and memory inefficient jobs cost more than they need to
- Jobs that die due to inadequate walltime or memory and need to be resubmitted are avoidable with good prior benchmarking 3) Provide resource estimating calculators that are critical for:
SIH in generating job quotes and planning work
Researchers in estimating their compute budget ahead of time

nandan75 commented 7 months ago

Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.

Activities

[x] Index sample Bams

Raw fastqs
Mapped to reference /g/data/er01/SIH-HPC-WGS/Reference/hs38DH.fasta using nf-core/sarek . Index generated
Bams used for Somatic-shortV-nf are temporarily placed at /scratch/er01/ndes8648/pipeline_work/nextflow/INFRA-83-Somatic-ShortV/Somatic-shortV-nf_noEmit/Somatic-shortV-nf_start_2024_main/big_bams_for_benchmarking/

[x] Execute workflow using Gadi script

A few runs were excecuted by changing number of intervals (no multi-threading steps in the Somatic-shortV pipeline, so one of the parameters to be tweaked is number of intervals).

Used the script perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl - but this gives a overall (not task specific view)

[ ] Collect benchmarks at the task level using gadi_nfcore_report.sh

I will use this script.

I am not too sure at this time how do we view all mutect2 tasks (scattered) together as a composite resource view- will check with someone.

[ ] Add benchmarks in table format to README. Structure as per existing SIH Gadi workflows

[ ] Add description of the compute environment it was run in as per existing SIH Gadi workflows

Requirements

[ ] Resolved generation of own intervals

[ ] Workflow README complete

Sydney-Informatics-Hub / Somatic-shortV-nf

Benchmark with human T/N samples #4