WGS-standards-and-analysis / datasets

Benchmark datasets for WGS analysis
37 stars 18 forks source link

Mismatch between VCF and reference FASTA in Salmonella enterica 1203NYJAP-1 simulated dataset #11

Open tabwalsh opened 6 years ago

tabwalsh commented 6 years ago

There appears to be a data mismatch in the Salmonella enterica 1203NYJAP-1 simulated dataset, between the reference alleles at variant positions reported in the VCF on the one hand, and the corresponding bases (or their positions) in the reference sequence on the other.

For example, the first record in the VCF reports a reference allele T in the first contig at position 27086, but this position contains a G in the reference:

$ samtools faidx GCA_000439415.1_ASM43941v1_genomic.fna CP006053.1:27086-27086
>CP006053.1:27086-27086
G

However, there is a T in the base position immediately before this:

$ samtools faidx GCA_000439415.1_ASM43941v1_genomic.fna CP006053.1:27085-27085
>CP006053.1:27085-27085
T

This seems to be the case for every variant in this dataset.

stevendavis commented 6 years ago

I'm seeing the same problem. There is an off-by-one error in the position numbers in the VCF file. For example, the first snp should be at position 27085. not 27086.

lskatz commented 6 years ago

Thank you for figuring that out @stevendavis ! It will be on my to do list.......

tseemann commented 6 years ago

I think there was/is a bug in treetoreads where the VCF and the CSV don't agree. @willpitchers also found a VCF wrapping bug.