bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
979 stars 355 forks source link

Very high peak load CPU load in bcbio RNAseq pipeline, probably GATK4 splitNCigarReads #2459

Closed WimSpee closed 4 years ago

WimSpee commented 5 years ago

Hi,

During a bcbio RNAseq analysis the peak load per cpu shot up to 600 on multiple machines, where a load of 60 per CPU should be the max. Nothing else was running on these machines.

# Template for  RNA-seq using Illumina prepared samples
---
details:
  - analysis: RNA-seq
    genome_build: my_reference
    algorithm:
      aligner: star
      strandedness: unstranded
      transcript_assembler: stringtie
      variantcaller: gatk-haplotype
      jointcaller: gatk-haplotype-joint
upload:
   dir: ../final

This is during a step before the GATK4 HaplotypeCaller command. According to the time-stamps in the monitor tool and in the bcbio log I think it is GATK4 splitNCigarReads.

A manual run on of the GATK4 splitNCigarReads shows that it by default uses between 1 and 2 CPU (i.e. 100% to 200% CPU). I am not sure this is enough to explain the 10X too high load on the machines (600 instead of 60). No other jobs were running on those machines.

See the monitoring tool screenshot: gatk4_rnaseq_peak_load_crop

Could you please have a look at the resources requested for GATK4 splitNCigarReads, I guess this currently is 1 CPU and should be 2 CPU.

I am not sure how too change the resources that bcbio requests for GATK4 splitNCigarReads or other tools before it in the RNAseq pipeline. My guess is this can be done in bcbio_system.yaml or somewhere in the code. Could you point me to the place where I can change the resource requests, then I can test if the peak load disappears.

Thank you.

chapmanb commented 5 years ago

Wim; Thanks for the testing and feedback on this. We're not allocating multiple cores for SplitNCigar reads intentionally so I'm not sure why you're seeing such high load. My guess is that this is an IO intensive step and load might be coming from other processes like the shared filesystem -- is that a possibility? Would you be able to look by process usage to see if it's something we can control within bcbio?

Practically, you can set core allocation for these steps using gatk in a resources block:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#resources

if you need to adjust it on your system to match what is happening practically. Hope this helps with managing your shared resources.

naumenko-sa commented 5 years ago

Hi @WimSpee !

Please note that false positive rate is pretty high in GATK4 RNA-seq variant calling. I switched back to gatk3.8. https://github.com/bcbio/bcbio-nextgen/issues/2410

SN

WimSpee commented 5 years ago

Hi @naumenko-sa . Thank you for the information. I noticed the #2410 issue and the increased FP rate you found. I also noticed myself that there are lot of false positive indels.

Maybe it makes sense to forward this ticket go the GATK issue board, or cc someone from the GATK team, e.g. Geraldine Van der Auwera.

I can't switch back to GATK3.8 because of license reasons. I hope the GATK team will be able to improve the RNAseq based variant calling capability of GATK4. Then we will probably reprocess our data.