TimD1 / vcfdist

vcfdist: Accurately benchmarking phased variant calls
GNU General Public License v3.0
70 stars 6 forks source link

Very high memory usage when realigning #27

Closed mbhall88 closed 4 months ago

mbhall88 commented 4 months ago

Using v2.5.1, when I use the --realign-truth --realign-query options I am getting crazy high memory usage. This is for a bacterial sample, so the genome is ~4MB and the VCF is of negligible size. So far I have had all my jobs fail when requesting 64GB of memory on our cluster. This seems much too high right?

TimD1 commented 4 months ago

Right now, the main limitation of vcfdist is memory usage. It isn't related to genome or VCF size, but scales quadratically with the size of the largest cluster. What is the largest variant in the VCF? If it's pretty large, I would recommend limiting the max size with -l 5000 or even -l 1000. If that doesn't work, could you send me the VCF and I can investigate?

mbhall88 commented 4 months ago

Currently I am using --largest-variant 50

Here is an example of a failing sample (they're all failing for all samples to be honest).

test_data.tar.gz

I ran vcfdist from the v2.5.1 docker image with the following command

echo "Calculated maximum QUAL score..." 1>&2
MAX_QUAL=$(bgzip -dc BPH2947__202310.50x.bcftools.filter.vcf.gz | grep -v '^#' | cut -f 6 | sort -gr | sed -n '1p')
echo "MAX_QUAL=$MAX_QUAL" 1>&2
echo "Running vcfdist..." 1>&2
vcfdist BPH2947__202310.50x.bcftools.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 \
  --credit-threshold 1.0 -d --realign-truth --realign-query -p BPH2947__202310/BPH2947__202310. \
  -b BPH2947__202310.bed -mx $MAX_QUAL

Let me know if you need anything else.

I should say, without the realign flags (and v2.3.3) this used no more than 500MB of memory

TimD1 commented 4 months ago

Thanks for sending over the test data! I was able to reproduce your error with large RAM usage, but found the culprit to actually be the optional --distance (-d) calculation.

I'm looking into it now.

TimD1 commented 4 months ago

This was caused by an indexing error in --distance calculation, and should now be fixed in the new release: v2.5.2.

mbhall88 commented 4 months ago

Amazing! Thanks for the quick turnaround. Gotta love bug hunting (I do at least).