broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

GenotypeGVCFs Poor Performace on Whole Genome Sequencing #8637

Closed DarioS closed 9 months ago

DarioS commented 9 months ago

I have imported into GenomicDB and am using 3200 intervals to paralelise across hg38 but most intervals don't finish within four hours. It worked decently for ten to fifteen samples but not now that I have 108 samples. Can you test it out on such data to reproduce?

mlathara commented 9 months ago

Can you tell us more about how you imported the data?

DarioS commented 9 months ago

CPU utilisation does not improve after the variants begin processing after half an hour preparing traversal.

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105581211 R ds6924 hm82 genotype   4  00:42:34  02:00:00 1200GB 1200GB 3072GB   768
                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105381052 R ds6924 hm82 genotype  61  00:19:55  10:00:00 1487MB 1487MB 4096MB     1

09:17:51.114 INFO  ProgressMeter -      chr10:106687146              1.2                  1000            822.3
09:18:01.308 INFO  ProgressMeter -      chr10:106710146              1.4                 24000          17315.6
09:18:21.691 INFO  ProgressMeter -      chr10:106721171              1.7                 35000          20281.0
09:18:31.944 INFO  ProgressMeter -      chr10:106742172              1.9                 56000          29526.0

Intervals take about fifteen minutes each instead of about seven hours if running serially. Outputting results to $PBS_JOBFS folder on compute node instead of directly to project folder did not improve performance at all.

nalinigans commented 9 months ago

Not sure what your GenotypeGVCFs command was, but did you use the --genomicsdb-shared-posixfs-optimizations option? This option is available for the import too and may improve your performance.

--genomicsdb-shared-posixfs-optimizations <Boolean>
                              Allow for optimizations to improve the usability and performance for shared Posix
                              Filesystems(e.g. NFS, Lustre). If set, file level locking is disabled and file system
                              writes are minimized.  Default value: false. Possible values: {true, false} 
mlathara commented 9 months ago

As @nalinigans suggested, the --genomicsdb-shared-posixfs-optimizations should help, though probably mostly for import. Similarly, I would highly recommend --bypass-feature-reader link for the import as well.

As I mentioned before, reblocking will help import and query - mainly because it reduces the input GVCF size by 5x-8x. Shouldn't be necessary for the number of samples you indicate, but will become more important as number of samples scales up (and does help at any number of samples, I should add).

That doesn't seem to the crux of your problem though...you note that running serially does better than trying to parallelize across many cores. I don't have a lot of insight into Lustre specifically, but do you have any metrics on how the IOPS looks for the Lustre FS in each case? Also, the bit about the the first set of variants taking a while - does that time look different when running serially versus in parallel?

One experiment to consider - maybe try to copy the workspace to the $PBS_JOBFS folder on the compute node before running GenotypeGVCFs. Not sure it is feasible in terms of amount of storage, etc but it would at least rule out possible Lustre issues.

DarioS commented 9 months ago

I copied the Genomics DB to the compute nodes rather than reading it from /scratch/hm82/ Lustre and voila! Good guess.

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105643164 R ds6924 hm82 genotype  60  00:10:03  02:00:00 2266GB 2266GB 3072GB   768