Closed DarioS closed 9 months ago
Can you tell us more about how you imported the data?
%CPU WallTime Time Lim RSS mem memlim cpus
normal-exe = open&run
105581211 R ds6924 hm82 genotype 4 00:18:25 02:00:00 1064GB 1064GB 3072GB 768
13:51:37.925 INFO GenotypeGVCFs - ------------------------------------------------------------
13:51:39.736 INFO GenotypeGVCFs - Done initializing engine
13:51:39.923 INFO ProgressMeter - Starting traversal
13:51:39.923 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
14:23:57.323 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr17:18363145 the annotation AS_RAW_MQ=64800.000|50400.000|0.000 was not a numerical value and was ignored
14:23:57.346 WARN ReferenceConfidenceVariantContextMerger - Reducible annotation 'AS_RAW_MQ' detected, add -G Standard -G AS_Standard to the command to annotate in the final VC with this annotation.
14:23:58.180 INFO ProgressMeter - chr17:18363854 32.3 1000 31.0
14:24:13.258 INFO ProgressMeter - chr17:18376854 32.6 14000 430.0
14:24:58.358 INFO ProgressMeter - chr17:18382854 33.3 20000 600.5
14:32:49.287 INFO ProgressMeter - chr17:18393855 41.2 31000 753.2
14:33:39.240 INFO ProgressMeter - chr17:18405856 42.0 43000 1024.1
14:33:49.493 INFO ProgressMeter - chr17:18411856 42.2 49000 1162.3
14:34:17.285 INFO ProgressMeter - chr17:18425856 42.6 63000 1478.1
CPU utilisation does not improve after the variants begin processing after half an hour preparing traversal.
%CPU WallTime Time Lim RSS mem memlim cpus
normal-exe = open&run
105581211 R ds6924 hm82 genotype 4 00:42:34 02:00:00 1200GB 1200GB 3072GB 768
%CPU WallTime Time Lim RSS mem memlim cpus
normal-exe = open&run
105381052 R ds6924 hm82 genotype 61 00:19:55 10:00:00 1487MB 1487MB 4096MB 1
09:17:51.114 INFO ProgressMeter - chr10:106687146 1.2 1000 822.3
09:18:01.308 INFO ProgressMeter - chr10:106710146 1.4 24000 17315.6
09:18:21.691 INFO ProgressMeter - chr10:106721171 1.7 35000 20281.0
09:18:31.944 INFO ProgressMeter - chr10:106742172 1.9 56000 29526.0
Intervals take about fifteen minutes each instead of about seven hours if running serially. Outputting results to $PBS_JOBFS
folder on compute node instead of directly to project folder did not improve performance at all.
Not sure what your GenotypeGVCFs
command was, but did you use the --genomicsdb-shared-posixfs-optimizations
option? This option is available for the import too and may improve your performance.
--genomicsdb-shared-posixfs-optimizations <Boolean>
Allow for optimizations to improve the usability and performance for shared Posix
Filesystems(e.g. NFS, Lustre). If set, file level locking is disabled and file system
writes are minimized. Default value: false. Possible values: {true, false}
As @nalinigans suggested, the --genomicsdb-shared-posixfs-optimizations
should help, though probably mostly for import. Similarly, I would highly recommend --bypass-feature-reader
link for the import as well.
As I mentioned before, reblocking will help import and query - mainly because it reduces the input GVCF size by 5x-8x. Shouldn't be necessary for the number of samples you indicate, but will become more important as number of samples scales up (and does help at any number of samples, I should add).
That doesn't seem to the crux of your problem though...you note that running serially does better than trying to parallelize across many cores. I don't have a lot of insight into Lustre specifically, but do you have any metrics on how the IOPS looks for the Lustre FS in each case? Also, the bit about the the first set of variants taking a while - does that time look different when running serially versus in parallel?
One experiment to consider - maybe try to copy the workspace to the $PBS_JOBFS
folder on the compute node before running GenotypeGVCFs
. Not sure it is feasible in terms of amount of storage, etc but it would at least rule out possible Lustre issues.
I copied the Genomics DB to the compute nodes rather than reading it from /scratch/hm82/
Lustre and voila! Good guess.
%CPU WallTime Time Lim RSS mem memlim cpus
normal-exe = open&run
105643164 R ds6924 hm82 genotype 60 00:10:03 02:00:00 2266GB 2266GB 3072GB 768
I have imported into GenomicDB and am using 3200 intervals to paralelise across hg38 but most intervals don't finish within four hours. It worked decently for ten to fifteen samples but not now that I have 108 samples. Can you test it out on such data to reproduce?