GenotypeGVCFs Poor Performace on Whole Genome Sequencing

DarioS commented 9 months ago

I have imported into GenomicDB and am using 3200 intervals to paralelise across hg38 but most intervals don't finish within four hours. It worked decently for ten to fifteen samples but not now that I have 108 samples. Can you test it out on such data to reproduce?

mlathara commented 9 months ago

Can you tell us more about how you imported the data?

Did you reblock the gvcfs.? This should help a lot with reducing the data size/memory requirements https://gatk.broadinstitute.org/hc/en-us/articles/13832696945307-ReblockGVCF
How did you partition into 3200 intervals?
Can you share what options you used for import?
What sort of memory/core count did each import job have available?
Did the import jobs eventually error out or finish?

DarioS commented 9 months ago

No reblocking.
Approximately equal-width of about 1 million bases intervals across human genome.
Import command used (university bioinformatics core facility's pipeline, not mine).

1 core and 4 GB RAM per task, but tasks seem to be using only about 1 GB RAM per task. 768 tasks (16 nodes) in total.

                            %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105581211 R ds6924 hm82 genotype   4  00:18:25  02:00:00 1064GB 1064GB 3072GB   768

Jobs eventually finish if not running out of allocated time.

Takes a long time to begin processing the first set of variants.

13:51:37.925 INFO  GenotypeGVCFs - ------------------------------------------------------------
13:51:39.736 INFO  GenotypeGVCFs - Done initializing engine
13:51:39.923 INFO  ProgressMeter - Starting traversal
13:51:39.923 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
14:23:57.323 WARN  ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr17:18363145 the annotation AS_RAW_MQ=64800.000|50400.000|0.000 was not a numerical value and was ignored
14:23:57.346 WARN  ReferenceConfidenceVariantContextMerger - Reducible annotation 'AS_RAW_MQ' detected, add -G Standard -G AS_Standard to the command to annotate in the final VC with this annotation.
14:23:58.180 INFO  ProgressMeter -       chr17:18363854             32.3                  1000             31.0
14:24:13.258 INFO  ProgressMeter -       chr17:18376854             32.6                 14000            430.0
14:24:58.358 INFO  ProgressMeter -       chr17:18382854             33.3                 20000            600.5
14:32:49.287 INFO  ProgressMeter -       chr17:18393855             41.2                 31000            753.2
14:33:39.240 INFO  ProgressMeter -       chr17:18405856             42.0                 43000           1024.1
14:33:49.493 INFO  ProgressMeter -       chr17:18411856             42.2                 49000           1162.3
14:34:17.285 INFO  ProgressMeter -       chr17:18425856             42.6                 63000           1478.1

CPU utilisation does not improve after the variants begin processing after half an hour preparing traversal.

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105581211 R ds6924 hm82 genotype   4  00:42:34  02:00:00 1200GB 1200GB 3072GB   768

Excellent CPU efficiency if running serially (but defeats the purpose of a H.P.C. with Lustre).

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105381052 R ds6924 hm82 genotype  61  00:19:55  10:00:00 1487MB 1487MB 4096MB     1

09:17:51.114 INFO  ProgressMeter -      chr10:106687146              1.2                  1000            822.3
09:18:01.308 INFO  ProgressMeter -      chr10:106710146              1.4                 24000          17315.6
09:18:21.691 INFO  ProgressMeter -      chr10:106721171              1.7                 35000          20281.0
09:18:31.944 INFO  ProgressMeter -      chr10:106742172              1.9                 56000          29526.0

Intervals take about fifteen minutes each instead of about seven hours if running serially. Outputting results to $PBS_JOBFS folder on compute node instead of directly to project folder did not improve performance at all.

nalinigans commented 9 months ago

Not sure what your GenotypeGVCFs command was, but did you use the --genomicsdb-shared-posixfs-optimizations option? This option is available for the import too and may improve your performance.

--genomicsdb-shared-posixfs-optimizations <Boolean>
                              Allow for optimizations to improve the usability and performance for shared Posix
                              Filesystems(e.g. NFS, Lustre). If set, file level locking is disabled and file system
                              writes are minimized.  Default value: false. Possible values: {true, false}

mlathara commented 9 months ago

As @nalinigans suggested, the --genomicsdb-shared-posixfs-optimizations should help, though probably mostly for import. Similarly, I would highly recommend --bypass-feature-reader link for the import as well.

As I mentioned before, reblocking will help import and query - mainly because it reduces the input GVCF size by 5x-8x. Shouldn't be necessary for the number of samples you indicate, but will become more important as number of samples scales up (and does help at any number of samples, I should add).

That doesn't seem to the crux of your problem though...you note that running serially does better than trying to parallelize across many cores. I don't have a lot of insight into Lustre specifically, but do you have any metrics on how the IOPS looks for the Lustre FS in each case? Also, the bit about the the first set of variants taking a while - does that time look different when running serially versus in parallel?

One experiment to consider - maybe try to copy the workspace to the $PBS_JOBFS folder on the compute node before running GenotypeGVCFs. Not sure it is feasible in terms of amount of storage, etc but it would at least rule out possible Lustre issues.

DarioS commented 9 months ago

I copied the Genomics DB to the compute nodes rather than reading it from /scratch/hm82/ Lustre and voila! Good guess.

                                %CPU  WallTime  Time Lim     RSS    mem memlim cpus
normal-exe = open&run
105643164 R ds6924 hm82 genotype  60  00:10:03  02:00:00 2266GB 2266GB 3072GB   768

broadinstitute / gatk

GenotypeGVCFs Poor Performace on Whole Genome Sequencing #8637