We tried to run GenotypeGVCFs from GATK 4.5 with -all-sites on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.
Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)
Time is in seconds, memory is in MB.
If I run the same command without -all-sites, the maximum memory usage is around 1.6 GB.
Steps to reproduce
GenomicDB was created using the corresponding GATK version as:
All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.
Bug Report
Affected tool(s) or class(es)
GenotypeGVCFs with
-all-sites
Affected version(s)
Description
We tried to run GenotypeGVCFs from GATK 4.5 with
-all-sites
on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)
Time is in seconds, memory is in MB.
If I run the same command without
-all-sites
, the maximum memory usage is around 1.6 GB.Steps to reproduce
GenomicDB was created using the corresponding GATK version as:
GenotypeGVCFs was run as:
All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.