Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Not all VCF records loaded. #19

Closed d-cameron closed 5 years ago

d-cameron commented 5 years ago

Cross-post of https://support.bioconductor.org/p/116002/ so I can attach the offending file.

length(readVcf("repo.vcf")) should be 2 not 1.

Looks like a buffer related bug as the preceeding record is suspiciously close to 4KiB

repo.vcf.txt

vobencha commented 5 years ago

Thanks @d-cameron for reporting this. Now fixed in this commit.

Currently the change is only in devel - if all looks good I'll port it to release.

vobencha commented 5 years ago

Fix has been merged into the RELEASE_3_8 branch.

vobencha commented 5 years ago

Need to re-open this one. See https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014539.html

d-cameron commented 5 years ago

Sorry - I should have done a more exhaustive test of my full VCFs before indicating the fix worked. Here is a minimal VCF that now fails:

testvcf = readVcf("failed.vcf", "")
names(testvcf)

failed.zip

d-cameron commented 5 years ago

It looks like the processing of the record aborts early and the parser attempts to read the rest of the record as a new record.

vobencha commented 5 years ago

@d-cameron I can't reproduce the problem with failed.zip. I see 5 records in the file and all 5 are read in. This is with VariantAnnotation 1.29.14:

> vcf = readVcf("failed.vcf")
> names(vcf)
[1] "gridss0_14980h"  "gridss8_54959h"  "gridss9_551870b" "gridss9_58481o" 
[5] "gridss9_58482o" 

I can reproduce the problem with repo.vcf.txt and will continue to work with that file for troubleshooting.

vobencha commented 5 years ago

@d-cameron @fasterius @kevinrue This is fixed in 1.28.10 in release and 1.29.16 in devel.

When reading records from a file, the logic checks both the length of the buffer and for the presence of an end of line character. If the buffer is full but end of line is not found, we reallocate and go back to the file to get the remainder of the line. This was the case for the vcf files in TVTB and seqCAT, the records were longer than the length of the allocated buffer.

However, record "gridss9_272646b" in repo.vcf.txt exactly filled the buffer. This revealed a bug in testing for the end of line character (used || instead of &&). As previously written, the test always evaluated to TRUE which triggered buffer reallocation and going back to the file to get the remainder of the record. In this case, going back for the remainder resulted in reading in the next record ("gridss9_31852o") which was then appended to "gridss9_272646b". The appended record was trimmed off in a later cleaning step because it came after an end of line character.

Thanks for reporting the bug and providing the test file. If you see further problems with this please re-open this issue.

d-cameron commented 5 years ago

This is fixed in 1.28.10 in release

I'm still getting failures in my full VCFs on 1.28.10 release. Would you like me to upload another minimal repo VCF?