Extremely high (>250GB) memory usage from GenotypeGVCFs w/ moderate-sized GenomicsDb input

broadinstitute / gatk

Official code repository for GATK versions 4 and up

https://software.broadinstitute.org/gatk

Other

1.68k stars 587 forks source link

Extremely high (>250GB) memory usage from GenotypeGVCFs w/ moderate-sized GenomicsDb input #7968

Open bbimber opened 2 years ago

bbimber commented 2 years ago

I am running GATK GenotypeGVCFs, v4.2.6.1. I am trying to call Genotypes on a GenomicsDB workspace with about 500 WGS samples. Note, this is the macaque MMul10 genome, so it has 2,939 contigs (including unplaced). We've run commands like this quite a lot before, though we periodically do have issues like this. We can consolidate on this workspace prior to running this (using a standalone tool @nalinigans provided on #7674). As you can see we ran java with relatively low RAM, but left ~150G for the C++ layer. I'm surprised this isnt good enough.

I'm going to try to interactively inspect this, but the error is from slurm killing my job, not a java memory error, which I believe means the -Xmx 92G isnt getting exceeded. I could be mistaken though.

You'll also see: 1) I'm using --force-output-intervals, 2) I'm giving it -XL to excluded repetitive regions (and therefore also skipping some of the more gnarly and memory-intensive sites), and 3) I'm giving it a fairly small -L interval list (this is split into 750 jobs/genome).

java8 \
-Djava.io.tmpdir=<folder> \
 -Xmx92g -Xms92g -Xss2m \
-jar GenomeAnalysisTK4.jar \
GenotypeGVCFs \
-R 128_Mmul_10.fasta \
--variant gendb:///<path>/WGS_v2_db03_500.gdb \
-O WGS_v2_db03_500.temp.vcf.gz \
--annotate-with-num-discovered-alleles \
-stand-call-conf 30 \
-XL NCBI_Mmul_10.softmask.bed \
--max-alternate-alleles 6 \
--genomicsdb-max-alternate-alleles 9 \
--force-output-intervals mmul10.WGS-WXS.whitelist.v2.3.sort.merge.bed \
-L 14:37234750-41196525 \
--only-output-calls-starting-in-intervals \
--genomicsdb-shared-posixfs-optimizations

Each job gets about 250K to 800K variants into the data, and then they pretty consistently start to exceed memory and get killed.

Does anyone have suggestions on debugging or troubleshooting steps? Thanks in advance for any help.

nalinigans commented 2 years ago

As you can see we ran java with relatively low RAM, but left ~150G for the C++ layer. I'm surprised this isnt good enough.

Joint genotyping runs on jvm and does require sufficient RAM to complete unlike GenomicsDBImport. We may need to balance the memory between jvm and native allocations. Can you run SelectVariants with this memory envelope?

bbimber commented 2 years ago

@nalinigans Yes, it's been surprising me quite a bit too. When you say 'can you run SelectVariants', do you mean simply trying to select from the source GenomicsDB workspace as a test to see if java has enough resources? I can try this.

jjfarrell commented 2 years ago

Does your pipeline reblock the gVCFs before merging into genomicsDB? I found this helped quite a bit with memory issues. The reblock decreases the size of the gVCF quite and the memory required for processing downstream.

bbimber commented 2 years ago

That's an interesting idea. The process of aggregating 2000 per sample gvcfs into workspaces (we can't get these to successfully combine into one workspace) is a lot of computation just by itself.

Is there a reblock that can be executed on the already combined genomics db workspace?

bbimber commented 2 years ago

@nalinigans, to your question about SelectVariants: it was better than GenotypeGVCFs. It worked once, but died with memory errors (killed by our slurm scheduler) a second time. It is also painfully slow. I ran a basic SelectVariants using the workspace with 500 samples. This workspace was processed with the standalone consolidate tool. It's running on an interval set that's only ~2m sites. The output is like this:

13:00:09.764 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
13:00:15.139 info  NativeGenomicsDB - pid=144146 tid=144147 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
13:00:15.145 info  NativeGenomicsDB - pid=144146 tid=144147 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
13:00:15.145 info  NativeGenomicsDB - pid=144146 tid=144147 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
13:00:17.976 INFO  FeatureManager - Using codec BEDCodec to read file file:///home/groups/prime-seq/production/Shared/@files/.referenceLibraries/128/tracks/NCBI_Mmul_10.softmask.bed
13:00:28.734 INFO  IntervalArgumentCollection - Initial include intervals span 3961776 loci; exclude intervals span 1586664325 loci
13:00:28.738 INFO  IntervalArgumentCollection - Excluding 2060069 loci from original intervals (52.00% reduction)
13:00:28.738 INFO  IntervalArgumentCollection - Processing 1901707 bp from intervals
13:00:28.816 INFO  SelectVariants - Done initializing engine
13:00:28.816 WARN  SelectVariants - ***************************************************************************************************************************
13:00:28.816 WARN  SelectVariants - * Detected unsorted genotype fields on input.                                                                             *
13:00:28.816 WARN  SelectVariants - * SelectVariants will sort the genotypes on output which could result in slow traversal as it involves genotype parsing.  *
13:00:28.816 WARN  SelectVariants - ***************************************************************************************************************************
13:00:28.941 INFO  ProgressMeter - Starting traversal
13:00:28.941 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.21497791400000002,Cpu time(s),0.113811361
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.9714307110000004,Cpu time(s),0.8294423339999996
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.018746290999999998,Cpu time(s),0.018747005
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.04312575600000001,Cpu time(s),0.04312843799999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.029108712999999998,Cpu time(s),0.029110260000000002
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0073808319999999995,Cpu time(s),0.007382536
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.029078561,Cpu time(s),0.029079955999999997
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.006109087,Cpu time(s),0.006077208000000001
13:25:54.636 INFO  ProgressMeter -           20:7039750             25.4                  1000             39.3
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.3064205629999998,Cpu time(s),0.30639567500000026
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.016820958,Cpu time(s),0.016806184

So you'll see it's progressing, but ~38 variants/min if I read this right. A few other things to note:

FWIW, this is using a GATK JAR I built locally using #7962, which has some minor changes to side-step a bug in GenotypeGVCFs. Those changes only touch two annotation classes.
GenomicsDB 1.4.4 mentions memory improvements - any reason to think trying that would make a difference?
One other thing to mention is that the MMul10 genome has ~2900 contigs. I dont understand precisely why this is a problem for GenomicsDB, but that has come up. Since we're only working on one contig (and usually a fraction of a contig) per job, could I subset my workspace to coax GenomicsDB to think it only has one contig? I believe I could just copy the contig folder and touch up the metadata JSON? I realize this isnt a great solution, but we're completely blocked here in terms of genotyping our data.
If SelectVariants actually worked here, could I run SelectVariants on the GenomicsDB workspace to create a combined gVCF for my ~2m site interval, and then run GenotypeGVCFs against this subset? It's not especially efficient, but if SelectVariants can pass and produce an output that's valid for GenotypeGVCFs that would actually be quite useful.

nalinigans commented 2 years ago

@bbimber, not sure what is happening here - the total of the GenomicsDB timers is 0.113811361+0.8294423339999996+0.018747005+0.04312843799999999+0.029110260000000002+0.007382536+0.029079955999999997+0.006077208000000001+0.30639567500000026+0.016806184 seconds which is minuscule compared to the 25 minutes. Any chance of running this with a java profiler to see where the time is being spent JVM(gatk) versus JNI(GenomicsDB)?

bbimber commented 2 years ago

@nalinigans I've used jprofiler locally for profiling, but in this instance I'd need to execute that remotely on the cluster and save the result to a file for local inspection. is there a tool you'd recommend for this?

Note: it seems like I might be able to use the native java one? Something like:

/home/exacloud/gscratch/prime-seq/java/java8/bin/java \
    -Xmx96g -Xms96g -Xss2m \
    -Xrunhprof:heap=dump,cpu=samples,format=b \

nalinigans commented 2 years ago

@droazen, @lbergelson, any pointers for remote java profiling for @bbimber?

bbimber commented 2 years ago

@nalinigans on a related perf question: there are posts about workspaces with lots of small contigs being a problem. There are some recommendations out there about creating multiple workspaces where each has one contig or a subset of contigs. Can you say any more about where that overhead comes from?

Given we have an existing multi-contig workspace, and aggregating this many samples into a workspace is pretty big task, are there any ways to separate the existing workspace into a bunch of single-contig workspaces? The only metadata that I see referring to contigs is vidmap.json. For example, subsetting a workspace could be something simple like this:

# DB refers to the original, and LOCAL_DB is a copy with just one of the contigs:
mkdir $LOCAL_DB
cp ${DB}/__tiledb_workspace.tdb ${LOCAL_DB}/
cp ${DB}/callset.json ${LOCAL_DB}/
cp ${DB}/vcfheader.vcf ${LOCAL_DB}/
cp ${DB}/vidmap.json ${LOCAL_DB}/
ln -s ${DB}/20\$1\$77137495 ${LOCAL_DB}/

Using this subset workspace seems to execute just fine as an input for SelectVariants.

bbimber commented 2 years ago

@nalinigans I was looking over the contents of this workspace and thought I'd pass along a couple observations. This is focusing on chromosome 20. This workspace has 573 WGS samples. When I inspect the contents of 20$1$77137495, there is one sub-folder with a GUID-based name. This make sense b/c we previously ran the standalone consolidate tool on it. Within this folder, a lot of these .tdb files are 10GB or larger. The biggest is PL_var.tdb (34G). END is 15GB, GQ is 15, etc. I dont really now how GenomicsDB/GATK handles reads, but do those sizes stand out to you?

mlathara commented 2 years ago

I've never used these, but https://github.com/jvm-profiling-tools could potentially be a source for java profilers. From reading a bit about hprof, it seems to add a lot of overhead and has questionable accuracy.

About workspaces with lots of contigs/smaller contigs -- the performance issue there is mostly during import. In your experiment above to subset the workspace, did the subsetted workspace return faster for SelectVariants? Or use less memory? I'm a bit surprised if so since your query is restricted to just that single array anyway.

Regarding @jjfarrell's suggestion of ReblockGVCFs -- I can't speak to any loss of precision, etc there but I would be curious to see if you could run some of your input GVCFs through that, just to see how much smaller they seem to get.

bbimber commented 2 years ago

@mlathara I'm not able to get this workspace to work at all for any of the permutations. GATK more or less dies with any operation that tries to use it.

Again, one difference is that this is the first time I've used the standalone consolidate tool from @nalinigans. I wonder if that has actually backfired from the perspective to reading this for GenotypeGVCFs or SelectVariants?

mlathara commented 2 years ago

OK -- got it. I wasn't sure if the below comment was implying any better performance.

Using this subset workspace seems to execute just fine as an input for SelectVariants.

Sounds like you were just saying it worked, which is expected. As I said, I wouldn't expect the query to work any faster with such a single contig.

I think some sort of profiling run, and trying ReBlockGVCFs on a few examples inputs are probably the best next steps.

bbimber commented 2 years ago

@mlathara Apologies, my comment was confusing. See replies below:

I should have clarified when I said that my workspace where I simply removed the extra contigs "executes just fine for SelectVariants". The job start progressing (slowly) without extreme/overt errors.
Regarding ReBlockGVCFs: the problem here is that I'm basically starting at step zero. It's not trivial to process >2000 gVCFs into genomicDb workspaces. Re-making all those gVCFs and then restarting the entire import is a huge hit. We already decided to only make workspaces with 250-500 samples (since it just wasnt working to go higher), and even that's a lot of computation time.

I gotta be honest, I'm pretty close to abandoning GenomicsDB and looking at other solutions.

bbimber commented 2 years ago

To be clear: I really appreciate the help from the GATK and GenomicsDB teams. There has just been a lot of problem and issues trying to make GenomicsDB work for this dataset

nalinigans commented 2 years ago

Would it still be possible to try the profiler with this workspace for some insight?

bbimber commented 2 years ago

@nalinigans I will see how feasible that is on our cluster.

Another question: I'm still baffled at the sort of issues we keep having if GenomicsDB is really used that widely. One question: I have been viewing the aggregated workspace as a semi-permanent store (more like a database). Rather than that, do most users just make the workspace on-the-fly, use it immediately, and then discard? I was thinking overnight about this, and I'm wondering if we should simply drop the idea of even trying to make workspaces with whole chromosomes. I think we could scatter 1000 jobs for the genome, give each a coordinate set, then import the 2000 gVCGs into a workspace of only 2m sites or so, do GenotypeGVCFs, and discard that workspace, and then merge all those VCFs.

I thought in the past I read guidance that the GenomicsDB workspace needed to import intact contigs. However, if the only downstream application is to run GenotypeGVCFs on a the same targeted region, is there any reason that woudlnt work? I would hope that running GenomicsDbImport with -L would import any gVCF variant overlapping that interval, and therefore I dont think subsetting to a partial chromosome would matter. Any comments on this would be appreciated.

mlathara commented 2 years ago

I'm not sure what proportion of users leverage the incremental import functionality...it wasn't available when GenomicsDBImport was first made available, but has been around for ~3 years now.

As for workspaces with whole chromosomes -- there is no requirement or performance benefits to using whole chromosomes. As you say, subsetting a chromosome to smaller regions will work and make the import and query parallelizable. (if you remember where the advice about whole chromosomes came from, let us know. That might be something that needs to be updated/clarified). Many small contigs does add overhead to import though and, till recently, multiple contigs couldn't be imported together (i.e., each contig would have it's own folder under the GenomicsDB workspace - which gets inefficient with many small contigs).

For WGS, probably the best way to create the GenomicsDBImport interval list is to split based on where there are consecutive N's in the reference genome (maybe using Picard) and/or regions that you are blacklisting. I think you suggested that some of the blacklisted regions were especially gnarly - maybe ploidy or high alternate allele count? - depending on the frequency of those, we may save a bit on space/memory requirements. That may address your concern about overlap between variants and import intervals. In general, any variant that starts in a specified import interval will show up in a query to that workspace. I'm not sure if the blacklist regions contain any variants that start within but extend beyond the blacklist -- those may not show up if the regions are split up in this way.

bbimber commented 2 years ago

@mlathara One additional observation. I made a single-sample GenomicsDB workspace. I imported one gVCF, and using a single interval of ~7m BP. I then tried to run a basic SelectVariants on this. This is the log (note the timestamps):

15:23:03.633 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
15:23:04.814 info  NativeGenomicsDB - pid=4658 tid=4659 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
15:23:04.814 info  NativeGenomicsDB - pid=4658 tid=4659 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
15:23:04.814 info  NativeGenomicsDB - pid=4658 tid=4659 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
15:23:06.221 INFO  IntervalArgumentCollection - Processing 7099472 bp from intervals
15:23:06.263 INFO  SelectVariants - Done initializing engine
15:23:06.400 INFO  ProgressMeter - Starting traversal
15:23:06.400 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
15:32:48.365 INFO  ProgressMeter -              20:7684              9.7                  1000            103.1
15:33:02.930 INFO  ProgressMeter -           20:1915103              9.9                392000          39428.0
15:33:13.624 INFO  ProgressMeter -           20:2537472             10.1                518000          51183.7
15:33:23.627 INFO  ProgressMeter -           20:3588818             10.3                724000          70379.4

You'll see it took nearly 10 minutes between when it first logs the 'starting traversal', and when it actually begins to traverse and report process. Is there some massive initialization step required by genomics DB? Just to reiterate, the input workspace in this case is tiny: one gVCF sample and only importing ~7m BP on one contig.

mlathara commented 2 years ago

That is interesting -- for comparison @nalinigans had done some SelectVariants runs on our servers with ~500 samples, for an interval with 51M base pairs (these were WES though, fwiw) and the period between starting traversal and the first progressmeter output was ~1min. I'm not sure why your example would take so much longer. How large was this workspace? Can you share it?

And would it be any easier to run the profiler with this smaller case...maybe you don't need to submit it as a remote job?

mlathara commented 2 years ago

I don't have much/any experience with ReblockGVCF but did want to note one piece of anecdotal evidence in it's favor -- I tried it on a 1000g gvcf that was 6.1G in size. It took about 55 mins and the resulting gvcf was 1.5G in size. That amount of reduction would certainly help reduce GenomicsDB import/query runtimes and memory requirements.

bbimber commented 2 years ago

@mlathara and @nalinigans A couple quick updates:

ReblockGVCFs reduced gVCF size by 5-8x as advertised. I re-ran this on our ~2000 gVCFs, which is possibly one of the main reasons for improvement below.
This meant we needed to scrap all existing workspaces. As a side comment, the poor tools around manipulation of GenomicsDB workspaces is a pretty major disadvantage. Your guidance seems to suggest they are designed as a quasi-permanent store of gVCF data. Maybe I'm missing something, but this doesnt seem very workable anymore. Any need to modify any sample that went into the workspace means the whole thing needs to be re-created. For example, we also plan to re-generate some older gVCFs with the newer HaplotypeCaller at some point in the future, and doing this would also mean we need to scrap any existing workspaces.
For this round, I started with the 2000 gVCFs, and ran scatter jobs where each has ~1/750th of the genome, split more or less evenly (i.e. no attempt yet to intelligently design borders). Unlike before, each job creates the workspace on-the-fly, and then immediately uses it for GenotypeGVCFs. The workspace is basically a throw-away intermediate file. As far as computational time, this is not that bad (at least for very small intervals/job). I also did not bother running consolidate on these, and imported with a batchSize of 50.
With the limited interval GenomicsDB workspaces, GenotypeGVCFs runs reasonably well.

So some open questions:

It's unclear why running GenotypeGVCFs with a GenomicsDB workspace that has intact chromosomes, even when using -L over a small interval, fails to run or runs painfully slowly with extremely high memory. I will try to find time for actual profiling, but this is a little cumbersome since I'm not sure I can run this on my windows dev machine. As noted above, given how awkward maintaining genomicsdb workspaces is, I'm currently thinking that we should view these as transient stores and not bother saving them after one use.
The scatter method (i.e. many workspaces where each has a very small region) seems like a huge improvement for creating the workspaces. As far as designing intervals: I understand the guidance around quasi-manually defining a genome-specific interval set that puts the borders within SNP-poor and NNNN regions. This said, I wonder if we could simply create the workspace where we take the intervals and pad them by like 1kb or so? This would make the workspaces marginally bigger and duplicate those data, but in the grand scheme of things probably doesnt make much computational difference? One thing I need to verify (and would be great if you know this upfront), is whether using GenomicsDbImport with -L would include any gVCF variant that spans the intervals or whether it just includes gVCF variants that start in those intervals. If this is the former, when it seems like simply padding when creating the workspace and then running GenotypeGVCFs with the argument for "only-output-genotypes-starting-in-intervals" would be a genome-agnostic method that accomplishes the same thing?

Anyway, thanks for your continued help on this.

mlathara commented 2 years ago

Glad to hear you were able to make progress. We're open to suggestions around improving the tooling for this. For instance, you mentioned wanting to redo samples -- we already have support in GenomicsDB for querying by sample. We should be able to expose that at the GATK level. As long as you're okay with renaming the sample when you re-generate the gVCFs that should work.

Technically we could expose support to modify existing samples, but that get's a bit hairy because of the way data is retrieved.

I'm not sure why the queries for intact chromosomes take so much longer. Since you were able to replicate with a single sample, ~7m interval is there any chance you can share just that bit (workspace, or even better that portion of the gvcf) and we can take a deeper look?

To your question about whether GenomicsDBImport includes variants that span the specified import interval: it will definitely include variants that start in those intervals, but it won't always store variants that start before the import interval. For deletions, we have some special handling for variants that start before the interval - they should show up represented by the star allele, but I don't think this is the case for insertions starting before the import interval.

bbimber commented 2 years ago

@mlathara Thanks for the reply. To add to that:

The use case around adding/removing/replacing samples can include any of:

A workspace was created, and subsequently a sample failed QC for some reason that wasnt apparently at first
A sample's gVCF data was generated with an older HaplotypeCaller version, and/or there is some new feature (like ReblockGVCFs) we want to take advantage of. In this situation renaming a sample is definitely non-optimal. Removing the older to-be-replaced sample is definitely preferable.

I'll look into sharing the workspace, but it's quite large.

As far as spanning the genotyping interval: My thinking is that the gVCF can potentially have large blocks, where the start might be far upstream of the genotyping interval, but the end is within the interval. When one does SelectVariants with -L, I am pretty sure (but need to double-check this), that any spanning variant would get included in the output. A less optimal but probably effective approach might be to do SelectVariants on the input gVCFs with the interval(s), and then import these subset gVCFs (which ought to contain any spanning variants, not just any variants starting within the interval). Do you have thoughts on how this should operate from the GenomicsDbImport perspective? Again, while I appreciate the rationale around special-design for your intervals, it seems like including those overlapping records also gets at this problem?

bbimber commented 2 years ago

Regarding GenomicsDbImport with intervals, here is a quick test. Again, the thing I'm trying to evaluate is whether it matters how I chunk the genome for GenomicsDBImport->GenotypeGVCFs. Downstream of this, I would pass the workspace to GenotypeGVCFs, with --only-output-calls-starting-in-intervals. The concern is whether we have variants spanning the intervals of two jobs, and whether separating the jobs would impact calls. In this example, GenotypeGVCFs would run over 1:1050-1150. For example, if we had a multi-NT variant that spanned 1148-1052, we'd want that called correctly no matter what intervals were used for the jobs.

I tried using running GenomicsDBImport with -L over a small region, or I ran SelectVariants on the gVCF first (which behaves a little differently), and then used that subset gVCF as input to GenomicsDBImport, where GenomicsDBImport is given the entire contig as the interval. The resulting workspaces will be slightly different, with the latter containing information over a wider region (GenomicsDBIport truncates start/end of the input records to just the target interval).

So if either of these workspaces is passed to GenotypeGVCFs, using --only-output-calls-starting-in-intervals and -L 1:1050-1150:

I think any upstream padding doesnt matter. If you have a multi-nucleotide polymorphism that starts upstream of 1050 but spans 1050, this job wouldnt be responsible for calling that. The prior job, which has an interval set upstream of this one should call it. I think GenomicsDbImport's behavior is fine here.

If you have a multi-NT variant that starts within 1050-1150, but extends outside (i.e. deletion or insertion starting at 1148), this could be a problem. The GenomicsDB workspace created with the interval 1:1050-1150 lacks the information to score that, right? The workspace created using the more permissive SelectVariants->GenomicsDBImport contains that downstream information and presumably would make the same call as if GenotypeGVCFs was given the intact chromosome as input, right?

However, it seems that if I simply create the workspace with a reasonably padded interval (adding 1kb should be more than enough for Illumina, right?), and then run GenotypeGVCFs with the original, unpassed interval, then the resulting workspace should contain all available information and GenotypeGVCFs should be able to make the same call as if it was given a whole-chromosome workspace as input.

Does that logic seem right?

# The Input gVCF
1   1040    .   A   <NON_REF>   .   .   END=1046    GT:DP:GQ:MIN_DP:PL  0/0:15:24:14:0,24,360
1   1047    .   T   <NON_REF>   .   .   END=1047    GT:DP:GQ:MIN_DP:PL  0/0:14:4:14:0,4,418
1   1048    .   G   <NON_REF>   .   .   END=1141    GT:DP:GQ:MIN_DP:PL  0/0:19:26:12:0,26,411
1   1142    .   C   T,<NON_REF> 115.64  .   BaseQRankSum=-2.237;DP=19;MQRankSum=-2.312;RAW_GT_COUNT=0,1,0;RAW_MQandDP=43640,19;ReadPosRankSum=0.851 GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:15,4,0:19:99:0|1:1142_C_T:123,0,551,168,563,731:1142:9,6,2,2
1   1143    .   G   <NON_REF>   .   .   END=1168    GT:DP:GQ:MIN_DP:PL  0/0:17:37:16:0,37,475
1   1169    .   G   A,<NON_REF> 123.64  .   BaseQRankSum=-1.808;DP=18;MQRankSum=-1.313;RAW_GT_COUNT=0,1,0;RAW_MQandDP=30190,18;ReadPosRankSum=1.331 GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:14,4,0:18:99:0|1:1142_C_T:131,0,455,168,467,635:1142:7,7,2,2
1   1170    .   C   <NON_REF>   .   .   END=1191    GT:DP:GQ:MIN_DP:PL  0/0:15:27:14:0,27,405
1   1192    .   C   G,<NON_REF> 130.64  .   BaseQRankSum=-1.811;DP=14;MQRankSum=-1.193;RAW_GT_COUNT=0,1,0;RAW_MQandDP=21790,14;ReadPosRankSum=-0.072    GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:10,4,0:14:99:0|1:1142_C_T:138,0,408,168,420,588:1142:5,5,2,2
1   1193    .   A   <NON_REF>   .   .   END=1196    GT:DP:GQ:MIN_DP:PL  0/0:12:25:12:0,25,371
1   1197    .   C   T,<NON_REF> 37.64   .   BaseQRankSum=2.528;DP=15;MQRankSum=-0.932;RAW_GT_COUNT=0,1,0;RAW_MQandDP=22682,15;ReadPosRankSum=1.804  GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:13,2,0:15:45:0|1:1142_C_T:45,0,540,84,546,630:1142:5,8,0,2
1   1198    .   C   T,<NON_REF> 37.64   .   BaseQRankSum=2.737;DP=15;MQRankSum=-0.932;RAW_GT_COUNT=0,1,0;RAW_MQandDP=22682,15;ReadPosRankSum=1.795  GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:13,2,0:15:45:0|1:1142_C_T:45,0,540,84,546,630:1142:5,8,0,2

# The output when running GenomicsDBImport with -L 1:1050-1150. It captures overlapping sites even if they dont start within the interval (good).
# GenomicsDBImport does truncate the start/end to stay within the interval. 
1   1050    .   G   <NON_REF>   .   .   END=1141    GT:GQ:MIN_DP:PL:DP  ./.:26:12:0,26,411:19
1   1142    .   C   T,<NON_REF> .   .   BaseQRankSum=-2.237;DP=19;MQRankSum=-2.312;RAW_GT_COUNT=0,1,0;RAW_MQandDP=43640,19;ReadPosRankSum=0.851 GT:AD:GQ:PGT:PID:PL:PS:SB:DP    .|.:15,4,0:99:0|1:1142_C_T:123,0,551,168,563,731:1142:9,6,2,2:19
1   1143    .   G   <NON_REF>   .   .   END=1150    GT:GQ:MIN_DP:PL:DP  ./.:37:16:0,37,475:17

# As a comparison, this is running SelectVariants with the same intervals, -L 1:1050-1150. It picks up the spanning variants, but does not truncate their start/end:
1   1048    .   G   <NON_REF>   .   .   END=1141    GT:DP:GQ:MIN_DP:PL  0/0:19:26:12:0,26,411
1   1142    .   C   T,<NON_REF> 115.64  .   BaseQRankSum=-2.237;DP=19;MQRankSum=-2.312;RAW_GT_COUNT=0,1,0;RAW_MQandDP=43640,19;ReadPosRankSum=0.851 GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:15,4,0:19:99:0|1:1142_C_T:123,0,551,168,563,731:1142:9,6,2,2
1   1143    .   G   <NON_REF>   .   .   END=1168    GT:DP:GQ:MIN_DP:PL  0/0:17:37:16:0,37,475

# I used the SelectVariants output from above as input to GenomicsDbImport, and gave it the entire contig as the input interval. As one might expect, this workspace has the original start/ends:
1   1048    .   G   <NON_REF>   .   .   END=1141    GT:GQ:MIN_DP:PL:DP  ./.:26:12:0,26,411:19
1   1142    .   C   T,<NON_REF> .   .   BaseQRankSum=-2.237;DP=19;MQRankSum=-2.312;RAW_GT_COUNT=0,1,0;RAW_MQandDP=43640,19;ReadPosRankSum=0.851 GT:AD:GQ:PGT:PID:PL:PS:SB:DP    .|.:15,4,0:99:0|1:1142_C_T:123,0,551,168,563,731:1142:9,6,2,2:19
1   1143    .   G   <NON_REF>   .   .   END=1168    GT:GQ:MIN_DP:PL:DP  ./.:37:16:0,37,475:17

nalinigans commented 2 years ago

If you have a multi-NT variant that starts within 1050-1150, but extends outside (i.e. deletion or insertion starting at 1148), this could be a problem. The GenomicsDB workspace created with the interval 1:1050-1150 lacks the information to score that, right?

GenomicsDB does store the END as a separate attribute for the interval, so the information is present even if the GenomicsDB array region does not span that far. The other questions I will leave it to @droazen and/or @mlathara to answer. Hopefully, you are able to make progress.

jjfarrell commented 1 year ago

Here is a script I ran to run the import on 3500 unblocked gvcfs. The script imports one chromosome per workspace. As the chromosomes get larger --more and more memory is needed. chr4 through 22 ran fine. The chr3 (see log below) ends without an error BUT with the callset.json NOT being written out. I could split the chr1-3 at the centromere to try it again. Any other suggestions? Would increasing -Xmx150g to 240g help?

For chromosome 1, which is still running, top indicates is using about 240g (after importing the 65 batches).

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM  TIME+ COMMAND
21698 farrell   20   0      443.7g 240.3g   1416 S  86.7 95.5   7398:14 java

#!/bin/bash -l
#$ -l mem_total=251
#$ -P casa
#$ -pe omp 32
#$ -l h_rt=240:00:00
module load miniconda/4.9.2
module load gatk/[4.2.6.1](http://4.2.6.1/)
conda activate /share/pkg.7/gatk/[4.2.6.1/install/gatk-4.2.6.1](http://4.2.6.1/install/gatk-4.2.6.1)

CHR=$1
DB="genomicsDB.rb.chr$CHR"
rm -rf $DB
# mkdir -p $DB
# mkdir tmp
echo "Processing chr$CHR"
echo "NSLOTS: $NSLOTS"
# head sample_map.chr$CHR.reblock.list
head sample_map.chr$CHR
wc   sample_map.chr$CHR
gatk --java-options "-Xmx150g -Xms16g" \
       GenomicsDBImport \
       --sample-name-map sample_map.chr$CHR \
       --genomicsdb-workspace-path $DB \
       --genomicsdb-shared-posixfs-optimizations True\
       --tmp-dir tmp \
       --L chr$CHR\
       --batch-size 50 \
       --bypass-feature-reader\
       --reader-threads 5\
       --merge-input-intervals \
       --overwrite-existing-genomicsdb-workspace\
       --consolidate

End of log on chr3

07:19:44.855 INFO  GenomicsDBImport - Done importing batch 38/65
08:05:11.651 INFO  GenomicsDBImport - Done importing batch 39/65
08:49:12.112 INFO  GenomicsDBImport - Done importing batch 40/65
09:32:39.526 INFO  GenomicsDBImport - Done importing batch 41/65
10:23:36.849 INFO  GenomicsDBImport - Done importing batch 42/65
11:24:50.566 INFO  GenomicsDBImport - Done importing batch 43/65
12:17:11.236 INFO  GenomicsDBImport - Done importing batch 44/65
13:11:10.869 INFO  GenomicsDBImport - Done importing batch 45/65
13:56:22.927 INFO  GenomicsDBImport - Done importing batch 46/65
14:45:02.333 INFO  GenomicsDBImport - Done importing batch 47/65
15:35:20.713 INFO  GenomicsDBImport - Done importing batch 48/65
16:32:30.162 INFO  GenomicsDBImport - Done importing batch 49/65
17:18:42.636 INFO  GenomicsDBImport - Done importing batch 50/65
18:09:09.353 INFO  GenomicsDBImport - Done importing batch 51/65
18:53:04.283 INFO  GenomicsDBImport - Done importing batch 52/65
19:36:40.808 INFO  GenomicsDBImport - Done importing batch 53/65
20:18:42.274 INFO  GenomicsDBImport - Done importing batch 54/65
21:01:51.304 INFO  GenomicsDBImport - Done importing batch 55/65
21:36:00.458 INFO  GenomicsDBImport - Done importing batch 56/65
22:08:38.587 INFO  GenomicsDBImport - Done importing batch 57/65
22:40:44.082 INFO  GenomicsDBImport - Done importing batch 58/65
23:14:11.202 INFO  GenomicsDBImport - Done importing batch 59/65
23:48:23.805 INFO  GenomicsDBImport - Done importing batch 60/65
00:20:35.869 INFO  GenomicsDBImport - Done importing batch 61/65
00:51:47.408 INFO  GenomicsDBImport - Done importing batch 62/65
01:25:23.587 INFO  GenomicsDBImport - Done importing batch 63/65
01:59:03.103 INFO  GenomicsDBImport - Done importing batch 64/65
Using GATK jar /share/pkg.7/gatk/[4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar](http://4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar) defined in environment variable GATK_LOCAL_JAR
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx150g -Xms16g -jar /share/pkg.7/gatk/[4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar](http://4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar) GenomicsDBImport --sample-name-map sample_map.chr3 --genomicsdb-workspace-path genomicsDB.rb.chr3 --genomicsdb-shared-posixfs-optimizations True --tmp-dir tmp --L chr3 --batch-size 50 --bypass-feature-reader --reader-threads 5 --merge-input-intervals --overwrite-existing-genomicsdb-workspace --consolidate
[farrell@scc-hadoop genomicsdb]$ ls genomicsDB.rb.chr3
__tiledb_workspace.tdb  chr3$1$198295559  vcfheader.vcf  vidmap.json

It never indicates that it imported batch 65/65. No error and the callset.json is missing which we found in chr4 to chr22. ls genomicsDB.rb.chr4

__tiledb_workspace.tdb callset.json chr4$1$190214555 vcfheader.vcf vidmap.json

jjfarrell commented 1 year ago

@bbimber @mlathara Here is a pretty good article for optimizing the GenomicsDBImport [https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines] There is some advice about handling many small contigs that may be useful.

To troubleshoot the GenomicsDBImport high memory issue my script have, I reran the script on chr1 to narrow down the source of the high memory issue. These are running on reblocked gvcfs.

Without --bypass-feature-reader and -consolidate
With --bypass-feature-reader
With --consolidate without --bypass-feature-reader (This ended up on a node with 384gb.) The other ran on 256GB nodes.

Test 2 ran the fastest with the lowest memory requirements (Wall clock 76 hours) Test 1 ran slower and required more memory 40-50% of 256GB (Wall Clock 94 hours) Test 3 ran initially faster with less memory than test 1 but by batch 65 it was using 75% of 384 GB. This job has not finished and appears stuck on importing batch 65. So the consolidate option appears to have a memory leak or using just requiring too much memory.

The -consolidate option was the culprit.

So rerunning chr1-3 with just the --bypass-feature-reader option (test2) ran fine without lots of memory being used. Below is the time output from chr1. The output shows the Maximum resident set size (kbytes): 2630440

Using GATK jar /share/pkg.7/gatk/4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar defined in environment variable GATK_LOCAL_JAR

Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx200g -Xms16g -jar /share/pkg.7/gatk/4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar GenomicsDBImport --sample-name-map sample_map.chr1 --genomicsdb-workspace-path genomicsDB.rb.bypass.time.chr1 --genomicsdb-shared-posixfs-optimizations True --tmp-dir tmp --bypass-feature-reader --L chr1 --batch-size 50 --reader-threads 4 --overwrite-existing-genomicsdb-workspace
        Command being timed: "gatk --java-options -Xmx200g -Xms16g GenomicsDBImport --sample-name-map sample_map.chr1 --genomicsdb-workspace-path genomicsDB.rb.bypass.time.chr1 --genomicsdb-shared-posixfs-optimizations True --tmp-dir tmp --bypass-feature-reader --L chr1 --batch-size 50 --reader-threads 4 --overwrite-existing-genomicsdb-workspace"
        User time (seconds): 270716.45
        System time (seconds): 1723.34
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 76:08:24
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2630440
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 5
        Minor (reclaiming a frame) page faults: 206030721
        Voluntary context switches: 11129822
        Involuntary context switches: 176522
        Swaps: 0
        File system inputs: 627981312
        File system outputs: 466730160
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So using the import on reblocked gvcfs using --bypass-feature-reader was the fastest way to import our 3500 gVCFs and minimize memory.

mlathara commented 1 year ago

@jjfarrell Glad you found that article useful!

In general, --consolidate will be memory and time intensive. It's not intuitive, but as you already figured out if --consolidate is enabled, we do it on the very last batch. If you only have on the order of a few hundred batches total, not having specified consolidate shouldn't affect read performance much.

The only other thing that would help scale here would be to break up your intervals so that larger contigs are split up into multiple regions. Less memory required and you can throw more cores at it (if you have them).

What sort of performance did you see on GenotypeGVCFs or SelectVariants? That could be the other issue with these large intervals.

bbimber commented 1 year ago

@jjfarrell and @mlathara: thanks for running these tests and sorry I havent been able to do anything yet with our data. I'm under some grant deadlines until into Oct, but I do hope to add to this. A couple comments:

1) that broadly matches my experience

2) My sense is that we were caught between and rock and a hard place with GenomicsDb and GenotypeGVCFs. Our workflow until this summer involved creating a workspace (running per-contig), which involved importing >1500 animals at first. This would execute OK when using a reasonable --batch-size on GenomicsDbImport. However, when we had large workspaces that were imported in lots of batches, GenotypeGVCFs (which we execute scattered, where each job works on a small interval) tended to perform badly and was a bottleneck (i.e. would effectively stall). Therefore we began to --consolidate the workspaces using GenomicsDBImport during the append process. Initially --consolidate worked; however, as @jjfarrell noted, that's memory intensive and once our workspace was a certain size, this basically died again. Therefore we even worked with @nalinigans to their the standalone GenomicsDB consolidate tool. This was a viable way to consolidate the workspaces and we successfully aggregated and consolidated all our data (which took a while). However, these massive, consolidated workspaces seem to choke GenotypeGVCFs. Therefore this process is still basically dead.

3) As I noted above, I'm currently giving up on trying to maintain permanent data in genomicsDB. There's so many advantages to not doing so, and letting the gVCFs exist as the permanent store. Notably, there are many reasons we would want/need to remake a gVCF (like the introduction of reblocking). Whenever any one of the source gVCFs changes, the workspace is basically worthless anyway (which is a massive waste of computation time). We've had great success running each GenotypeGVCFs scattered, where each job runs GenomicsDbImport on-the-fly, to make a transient workspace. I havent heard a GATK reply, but I believe that giving each workspace a sufficient amount of downstream padding (we're using 1000bp) should ensure any variant that begins within the job's interval can be called properly. It does add non-trivial additional computation time to each GenotypeGVCFs job, but we were wasting all sorts of computation time pre-aggregating GenomicsDbImport workspaces.

What we're seeing in option 3 is consistent with some kind of problem in GenotypeGVCFs/SelectVariants when trying to read from GenomicsDB workspaces that have large chromosomes with highly consolidated data. In those workspaces, I was seeing single files with size of >30GB (like PL_var.tdb). I dont know the read pattern of GATK/GenomicsDB, but maybe over-consolidating is deleterious?

mlathara commented 1 year ago

I do agree that if the source gVCFS are being remade often there isn't much use of keeping genomicsdb as a permanent store. If it is just a few samples here and there, we could add some tooling to ignore and/or rename samples which should save you a lot of compute. But as you say, with something like reblocking the whole store effectively needs to be remade.

jjfarrell commented 1 year ago

@bbimber @mlathara @nalinigans
Just saw this cell article in which the Decode group describe how they were able to run GATK on 150k samples.

The sequences of 150,119 genomes in the UK Biobank.](https://pubmed.ncbi.nlm.nih.gov/35859178/ Halldorsson BV, et al. Nature. 2022 Jul;607(7920):732-740. doi: 10.1038/s41586-022-04965-x. Epub 2022 Jul 20.PMID: 35859178

On page 69+ of this pdf, they describe the problem and how they cleverly worked around it.

https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-04965-x/MediaObjects/41586_2022_4965_MOESM1_ESM.pdf

It should be noted that running GATK out of the box will cause every job to read the entire gVCF index file (.tbi) for each of the 150,119 samples. The average size of the index files is 4.15MB, so each job would have to read 4.15150,126 = 623GB of data on top of the actual gVCF slice data. For 60,000 jobs, this would amount to 623GB60,000 = 37PB or 25.2GB/sec of additional read overhead if the jobs are run on 20,000 cores in 17 days. This read overhead will definitely prevent 20,000 cores from being used simultaneously. However, this problem was avoided by pre-processing the .tbi files and modifying the software reading the gVCF files from the central storage in a similar fashion as we did for GraphTyper and the CRAM index files (.crai).

This explains why chr1 requires more memory than chr22 despite running on the same number of samples. The larger chr1 tbi index is the source of the memory problem. The Decode solution is too limit the reading of the tbi index to the part that indexes the scattered region. There is a long pause at the beginning of the running GenotypeGVCFs which I never understood. GATK must be the reading of all the sample's gvcfs tbi into memory during that pause. So the reblocking of the gvcfs above reduced the memory foot print by decreasing the tbi size. Decode reduced it by chopping up the index so for each scattered region, GATK could only read a small subset of the index needed for that region. The combination of reblocking and chopping up the tbi would help with the memory requirements even more. However, it is clear that GATK's present reading of the full tbi is not scalable given the memory requirements.

Han-Cao commented 11 months ago

Hi @jjfarrell ,

Thanks for the great explanation. Do you know how to chop the index into scattered regions? I searched for the manual of tabix and bcftools but cannot find a way to do that.