broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

GenotypeGVCFs takes a long time on small interval #4512

Closed chandrans closed 4 years ago

chandrans commented 6 years ago

Bug Report

Affected tool(s)

GenotypeGVCFs

Affected version(s)

4.0.2.0

Description

After running GenomicsDBImport which takes a short time, GenotypeGVCFs takes a really long time to genotype a short interval. It should not take so long.

This Issue was generated from your forums

chandrans commented 6 years ago

@lbergelson @droazen If you guys can assign someone to this or want to look into it yourself, I can send you the file location and commands.

droazen commented 6 years ago

@chandrans Someone will have to take a look in a profiler to see what's going on with these particular samples. Can you add a runnable test case that reproduces the issue here (and confirm yourself that you can reproduce it).

chandrans commented 6 years ago

Hey David. Yes, I have a test case described in https://github.com/broadinstitute/dsde-docs/issues/2985 that has files and commands that reproduce this. I reproduced it myself last night with 4.0.2.0. I Hope that is okay. If you need me to try 4.0.2.1 I can.

droazen commented 6 years ago

Great, thanks. @jonn-smith will have a look within the next few weeks.

ldgauthier commented 6 years ago

In the meantime, the user can try using the -new-qual argument. "A few hundred samples" is a lot of samples and the classic QUAL calculation algorithm doesn't scale as well as the new one. GenotypeGVCFs will scale to "at least a few thousand samples" as the user desires because we've run 20K in production, but we did it with the -new-qual argument: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl

davidbenjamin commented 4 years ago

New qual.