Closed mbhall88 closed 4 months ago
Right now, the main limitation of vcfdist is memory usage. It isn't related to genome or VCF size, but scales quadratically with the size of the largest cluster. What is the largest variant in the VCF? If it's pretty large, I would recommend limiting the max size with -l 5000
or even -l 1000
. If that doesn't work, could you send me the VCF and I can investigate?
Currently I am using --largest-variant 50
Here is an example of a failing sample (they're all failing for all samples to be honest).
I ran vcfdist from the v2.5.1 docker image with the following command
echo "Calculated maximum QUAL score..." 1>&2
MAX_QUAL=$(bgzip -dc BPH2947__202310.50x.bcftools.filter.vcf.gz | grep -v '^#' | cut -f 6 | sort -gr | sed -n '1p')
echo "MAX_QUAL=$MAX_QUAL" 1>&2
echo "Running vcfdist..." 1>&2
vcfdist BPH2947__202310.50x.bcftools.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 \
--credit-threshold 1.0 -d --realign-truth --realign-query -p BPH2947__202310/BPH2947__202310. \
-b BPH2947__202310.bed -mx $MAX_QUAL
Let me know if you need anything else.
I should say, without the realign flags (and v2.3.3) this used no more than 500MB of memory
Thanks for sending over the test data! I was able to reproduce your error with large RAM usage, but found the culprit to actually be the optional --distance
(-d
) calculation.
I'm looking into it now.
This was caused by an indexing error in --distance
calculation, and should now be fixed in the new release: v2.5.2
.
Amazing! Thanks for the quick turnaround. Gotta love bug hunting (I do at least).
Using v2.5.1, when I use the
--realign-truth --realign-query
options I am getting crazy high memory usage. This is for a bacterial sample, so the genome is ~4MB and the VCF is of negligible size. So far I have had all my jobs fail when requesting 64GB of memory on our cluster. This seems much too high right?