Open georgiesamaha opened 8 months ago
Takeaways from discussion with Cali Willet about benchmarking steps
I think the first thing would be to run each step at the settings Tracy last had them, and then try a tweak here and there, eg if she had 1 CPU and 4 GB mem for a GATK task, try:
Different intervals run that setting for
Does your workflow check the samples for coverage, either via a required field in the sample inputs file or from a pre-processing step as part of the worfklow? Definitely relevant. Having the user provide it in the config is one option, but this can be a pain for the user. You can deduce it very easily from a number of methods, for example bam coverage tools, raw estimates from fastqc etc
most TN datasets re coming in at 30/60 but some are 45/90, 30/90, etc. And this may change (will change!) as sequencing gets even cheaper, so having it a feature now will save trouble later. The coverage value will mainly impact the walltime that is requested and the mem and of course if the workflow is to be run on cloud, having the coverage both benchmarked and incorporated into the workflow will help the user budget
gadi usage script
if you add this line to your bashrc, you can run it simply:
alias usage='perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl' https://teams.microsoft.com/l/message/19:0dc09393-cd81-4909-9ee7-2070f4a3b48a_bbddd589-9092-40df-ad27-03047644fcbf@unq.gbl.spaces/1705623618471?context=%7B%22contextType%22%3A%22chat%22%7D
Interval sizes back to the intervals question - 3200 was a good value for making our NCMAS applications look cool. you dont need to stick with that. but i wouldnt go any smaller than that, as you noted re the minimum Mb size. It may be pertinent to revisit the interval number, and take a look at which job and which interval is the slowest, and decide if you think a decrease is warranted. I would not want to see any single scatter-gather step with a walltime > 1 hour.
To summarise all the talk on benchmarking, there are a few overarching goals:
1) Determine the optimal resources to request for a job or task given the nature of the data
Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.
Activities
- [x] Index sample Bams
/g/data/er01/SIH-HPC-WGS/Reference/hs38DH.fasta
using nf-core/sarek . Index generated /scratch/er01/ndes8648/pipeline_work/nextflow/INFRA-83-Somatic-ShortV/Somatic-shortV-nf_noEmit/Somatic-shortV-nf_start_2024_main/big_bams_for_benchmarking/
[x] Execute workflow using Gadi script
A few runs were excecuted by changing number of intervals (no multi-threading steps in the Somatic-shortV pipeline, so one of the parameters to be tweaked is
number of intervals
).Used the script
perl /g/data/er01/HPC_usage_reports/gadi_usage_report_v1.1.pl
- but this gives a overall (not task specific view)[ ] Collect benchmarks at the task level using
gadi_nfcore_report.sh
I will use this script.
I am not too sure at this time how do we view all mutect2 tasks (scattered) together as a composite resource view- will check with someone.
[ ] Add benchmarks in table format to README. Structure as per existing SIH Gadi workflows
[ ] Add description of the compute environment it was run in as per existing SIH Gadi workflows
Requirements
- [ ] Resolved generation of own intervals
- [ ] Workflow README complete
Description: Document workflow performance benchmarks in README. Execute workflow with a full 60x/30x human T/N pair at NCI Gadi.
Activities
gadi_nfcore_report.sh
Requirements