Very high RAM usage when storing plant variant data from GVCFs

aberthel commented 6 months ago

Hello!

I have been using TileDB-VCF to store variant data from several dozen sorghum lines. After several attempts to run tiledbvcf store resulted in the process being killed, I noticed that it was using upwards of 400 GB of RAM (more than my machine had free at the time). Even trying to store variant data from a single sample required more than 350 GB of RAM. Each sorghum gvcf is about 1-2GB uncompressed and contains 6-8 million records, so that seemed rather high! I had similar results trying to store maize genomes as well.

Original conditions were as follows:

OS: Rocky Linux version 9.0 tiledb: 2.17.4 tiledbvcf: 0.25

I also testing it on a Mac with an M1 chip running macOS version 12.5 (same tiledb/vcf versions). This machine only had 64 GB of RAM to begin with, and so the process ran out of free memory very quickly!

I have tracked down the source of this problem to the large deletions present in my gvcfs. The gvcfs were generated from whole-genome alignments of assemblies using the AnchorWave tool, and as a result they capture large insertions and deletions, some of which are 10s of Mbp long. It was most convenient for me to include the full sequence of the insertions and deletions rather than using symbolic <INS> or <DEL> alleles, so that sequence could be reconstructed with only the reference fasta. However, the large deletions specifically seemed to be the cause of the high RAM usage. Removing all deletions more than 100 bp long or replacing deletions with symbolic <DEL> alleles dropped the peak RAM usage to a more workable 50 GB with the default 10 samples per batch. Large insertions do not appear to have a great effect on the RAM usage.

That said, I'm unclear as to why the deletions are such a problem in the first place. Do you have any idea what would be eating up that much RAM? Is it a bug, or just a logical consequence of how deletions are handled? Any insight you have would be appreciated.

gspowley commented 6 months ago

Hi @aberthel!

You are right, the long deletions are using a lot of memory during ingestion. This happens because we are creating an anchor every 1000 bp by default (more details here).

Please try increasing the anchor gap when creating the array to avoid this issue:

tiledbvcf create --anchor-gap 1000000 ...

FYI, we have seen a similar issue with CNV VCFs and increasing the anchor gap provided good results.

aberthel commented 6 months ago

Sure enough! I increased --anchor-gap as you suggested and now it's working quite well. Thanks for the help!

TileDB-Inc / TileDB-VCF

Very high RAM usage when storing plant variant data from GVCFs #647