TimD1 / vcfdist

vcfdist: Accurately benchmarking phased variant calls
GNU General Public License v3.0
76 stars 7 forks source link

Contig not in reference FASTA or position out of range (generate_str) #28

Closed mbhall88 closed 4 months ago

mbhall88 commented 6 months ago

I have hit this error

https://github.com/TimD1/vcfdist/blob/92db7b547a8ebbfacaa87f29896045bcdac2410b/src/dist.cpp#L131

Specifically, I get

[ERROR vcfdist 12:14:06] Contig 'plasmid_2' not in reference FASTA or position out of range (generate_str)

plasmid_2 is definitely in the reference FASTA, so my only assumption is there some out of range/indexing problem.

Here is a tarball with the files used. They were run with v2.5.2 with the command

vcfdist BPH2947__202310.10x.bcftools.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 --credit-threshold 1.0 --realign-truth --realign-query -p BPH2947__202310.without_repetitive_regions. -b BPH2947__202310.unique_regions.bed -mx 234.985

test_data.tar.gz

TimD1 commented 4 months ago

Thanks for raising an issue and providing test data!

Sorry for the month delay it took me to get around to this. In the meantime, I finished my PhD requirements, got engaged, and started a new job :)

The error was caused by a variant at position 17 on plasmid_2. When clustering variants, vcfdist tried searching for nearby dependent variants and went off the end of the contig. A simple boundary check fixed it. I hadn't run into this yet since I'm working with human genomes (which don't report variants near chromosome ends due to the telomeres).

mbhall88 commented 4 months ago

Oh wow triple congratulations! That's wonderful news. And absolutely no need to apologise.

Oh cool, glad it was a good bug to find. And thanks for the quick fix. I will test it out in the next week or two.