Closed WimSpee closed 4 years ago
Wim; Thanks for the testing and feedback on this. We're not allocating multiple cores for SplitNCigar reads intentionally so I'm not sure why you're seeing such high load. My guess is that this is an IO intensive step and load might be coming from other processes like the shared filesystem -- is that a possibility? Would you be able to look by process usage to see if it's something we can control within bcbio?
Practically, you can set core allocation for these steps using gatk
in a resources block:
https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#resources
if you need to adjust it on your system to match what is happening practically. Hope this helps with managing your shared resources.
Hi @WimSpee !
Please note that false positive rate is pretty high in GATK4 RNA-seq variant calling. I switched back to gatk3.8. https://github.com/bcbio/bcbio-nextgen/issues/2410
SN
Hi @naumenko-sa . Thank you for the information. I noticed the #2410 issue and the increased FP rate you found. I also noticed myself that there are lot of false positive indels.
Maybe it makes sense to forward this ticket go the GATK issue board, or cc someone from the GATK team, e.g. Geraldine Van der Auwera.
I can't switch back to GATK3.8 because of license reasons. I hope the GATK team will be able to improve the RNAseq based variant calling capability of GATK4. Then we will probably reprocess our data.
Hi,
During a bcbio RNAseq analysis the peak load per cpu shot up to 600 on multiple machines, where a load of 60 per CPU should be the max. Nothing else was running on these machines.
This is during a step before the
GATK4 HaplotypeCaller
command. According to the time-stamps in the monitor tool and in the bcbio log I think it isGATK4 splitNCigarReads
.A manual run on of the
GATK4 splitNCigarReads
shows that it by default uses between 1 and 2 CPU (i.e. 100% to 200% CPU). I am not sure this is enough to explain the 10X too high load on the machines (600 instead of 60). No other jobs were running on those machines.See the monitoring tool screenshot:![gatk4_rnaseq_peak_load_crop](https://user-images.githubusercontent.com/34706930/43001785-cbd8f00a-8c26-11e8-82a6-07f1cf27c52a.png)
Could you please have a look at the resources requested for
GATK4 splitNCigarReads
, I guess this currently is 1 CPU and should be 2 CPU.I am not sure how too change the resources that bcbio requests for
GATK4 splitNCigarReads
or other tools before it in the RNAseq pipeline. My guess is this can be done in bcbio_system.yaml or somewhere in the code. Could you point me to the place where I can change the resource requests, then I can test if the peak load disappears.Thank you.