Open johnstonmj opened 6 months ago
Thanks very much for this report. Would you be willing to share the VCF or a slice of it that is sufficient to reproduce the problem?
I've attached a minimal example here. min_example.zip
Also, sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_American Samoa.utf8 LC_CTYPE=English_American Samoa.utf8
[3] LC_MONETARY=English_American Samoa.utf8 LC_NUMERIC=C
[5] LC_TIME=English_American Samoa.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.48.1 Rsamtools_2.18.0
[3] Biostrings_2.70.3 XVector_0.42.0
[5] SummarizedExperiment_1.32.0 Biobase_2.62.0
[7] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
[9] IRanges_2.36.0 S4Vectors_0.40.2
[11] MatrixGenerics_1.14.0 matrixStats_1.2.0
[13] BiocGenerics_0.48.1 dplyr_1.1.4
Wonderful example. I think the problem you are observing arises from
setMethod("mapToTranscripts", c("GenomicRanges", "GRangesList"),
function(x, transcripts, ignore.strand=FALSE, intronJunctions=FALSE)
in GenomicFeatures, where there is an overlap test that uses type "within".
The initial developers of both GenomicFeatures and VariantAnnotation are long departed from the project. I think we may have to promote overlap type setting to a parameter but we'll have to discuss internally. @hpages
Ah, so it might be because the deletion extends beyond the end of a single annotated feature? Or potentially extends into the next annotation? That would make sense. Thanks for investigating!
OK -- so what are your thoughts? Would you want to make a PR that identifies the event and emits message or warning, or is the behavior reasonable in this form? We could just improve the documentation.
I have encountered an issue with large INDELs being silently discarded from my results.
This is the offending VCF entry:
contig_xyz 133527 Sniffles2.DEL.50S0 GGATGAACTCTTGGATTCTTGGGCTCATGTCGGACAGGATGCCGCCCCAGCTTCTCAGGTTGATGCCCTGGGATCTAGAATCCAGCAGTGTGGGCAGCACCCGGAACAGCTTGAAGAAGTCCACGTTGGCGTACAGGGTATCCTCGATCCACTGCAGGGTTCCCTGGCTCAGACTGCACAGGGCATATCTGACGGTCTTGGCGCCTCTCCGCTGGCTGAAGATGATGAACCGTTCCAGCAGGGCCTCAGAACAGGCGATATCCTTCAGGGCGAGATCTGGCACGCCATGAGCAAACTGCTCGGGCCGCACTTGGCTGTTGATCAGCAGGTACACCACGCTGTCGCTCAGGCCGATGTTCTTGATGAGGAACAGTGTCAGGGTTTCCTCGTCCTTCAGGATGTCCCGGATTCTGATGCCCCTGCCGGCGATTCTCTCGGGGTGTGTTCTCAGGGTGTCCATGAACTGGCTCAGGATGTGCAGCTCGGTCCAGATTCTGCCCAGGTGCTGAGACTCAGGGGCGTTCATCAGCAGCTCTTGGAAGTCCCGGTACACTCTGGCCAGGATGCTGTTGTTGTAGTTGGACACGATGCCAGGGCTTTCGCCAGGTGTGGGGCTCTGAAAGCAGGGGTTGTTCACGTTGCAGAAGATGCCCTGCAGCCAAGGCAGCATTCCGGCAGAAGGCATGGCCTTGTTGGGGAAGTGACACTCGTGGTGGCTGTACAGAGGATTGGCGTTCCGCAGCCAGATCAGCACCAGAAACAGGCTCAGGGGCCACACGAGTTCCACCACGAATCTGATTTTCTGCCGCTTCCGCAGGGTCCAGTTCTTCCACAGCAGCAGCTGAATCTGCCGCACGAATCCCATGGTGGCCAGCTCGGTCGTCCCGGGGCCTCTACTGTCCAGAGTCCTCCGCGGATCCCGATCTGACGGTTCACTAAACGAGCTCTGTTTATATAGACCTCCCACCGTACACGCCTACCGCCCATTTGCGTCAACGGGGCGGGCGATCGCAGTTGTTACGACATTTTGGAAAGTCCCGTTGATTTTGGTGCCAAAACAAACTCCCATTGACGTCAATGGGGTGGAGACTTGGAAATCCCCGTGAGTCAAACCGCTATCCACGCCCATTGGTGTACTGCCAAAACCGCATCACCATGGTAATAGCGATGACTAATACGTAGATGTACTGCCAAGTAGGAAAGTCCCGTAAGGTCATGTACTGGGCATAATGCCAGGCGGGCCATTTACCGTCATTGACGTCAATAGGGGGCGGACTTGGCATATGATACACTTGATGTACTGCCAAGTGGGCAGTTTACCGTAAATACTCCACCCATTGACGTCAATGGAAAGTCCCTATTGGCGTTACTATGGGAACATACGTCATTATTGACGTCAATGGGCGGGGGTCGTTGGGCGGTCAGCCAGGCGGGCCATTTACCGTAAGTTATGTAACGCGGAACTCCATATATGGGCTATGAACTAATGACCCCGTAATTGATTACTATTAATAACTAGTCAATAATCAATGCCAACATGGCGGTCATATTGGACATGAGCCAATATAAATGTACATATTATGATATAGATACAACGTATGCAATGGCCAATAGCCAATATTGATTTATGCTATATAACCAATGAATAATATGGCTAATGGCCAATATTGAAGATCCCCGGGTACCGAGCTCGAATTCATCGATGATGATCCACTAGTAACGGCCGCCAGTGTGCTGGAATTCGCCCTTCCCGCATGGCATCTCATTACCGCCCGATCCGGCGGTTTCCGCTTCCGTTCCGCATGCTAACGAGGAACGGGCAGGGGGCGGGGCCCGGGCCCCGACTTCCCGGTTCGGCGGTAATGTGATACGAGCCCCGCGCGCCCGTTGGCCGTCCCCGGGCCCCCGGTCCCGCCCGCCGGACGCCGGGACCAACGGGACGGCGGGCGGCCCTTGGGCCGCCCGCCTTGCCGCCCCCCCATTGGCCGGCGGGCGGGACCGCCCCAAGGGGGCGGGGCCGCCGGGTAAAAGAAGTGAGAACGCGAAGCGTTCGCACTTCGTCCCAATATATATATATTATTAGGGCGAAGTGCGAGCACTGGCGCCGTGCCCGACTCCGCGCCGGCCCCGGGGGCGGACCCGGGCGGCGGGGGGCGGGTCTCTCCGGCGCACATAAAGGCCCGGCGCGACCGACGCCCGCAGACGGCGCCGGCCACGAACGACGGGAGCGGCTGCGGAGCACGCGGACCGGGAGCGGGAGTCGCAGAGGGCCGTCGGAGCGGACGGCGTCGGCATCGCGACGCCCCGGCTCGGGATCGGGATCGCATCGGAAAGGGACACGCGGACGCGGGGGGGAAAGACCCGCCCACCCCACCCACGAAACACAGGGGACGCACCCCGGGGGCCTCCGACGACAGAAACCCACCGGTCCGCCTTTTTTGCACGGGTAAGCACCTTGGGTGGGCGGAGGAGGGGGGACGCGGGGGCGGAGGAGGGGGGACGCGGGGGCCGGAGGAGGGGGGACGCGGGGGCGGAGGAGGGGG G 59 PASS PRECISE;SVTYPE=DEL;SVLEN=-2542;END=136069;SUPPORT=259;COVERAGE=335,326,308,260,257;STRAND=+-;AF=0.869;STDEV_LEN=0;STDEV_POS=0 GT:GQ:DR:DV 1/1:60:39:259
This represents a deletion of 2542 bp.
The large deletion is present when I call
rowRanges(vcf)
So it is successfully being read by
readVcf()
However, when I call:
or
predictCoding(vcf, txdb, seqSource=fasta)
this variant position is not included among the results.
With some trial and error, I have determined that truncating the REF sequence to 800 characters allows the variant to be maintained, but 850 characters fails.
Success, 850 bp
contig_xyz 133527 Sniffles2.DEL.50S0 GGATGAACTCTTGGATTCTTGGGCTCATGTCGGACAGGATGCCGCCCCAGCTTCTCAGGTTGATGCCCTGGGATCTAGAATCCAGCAGTGTGGGCAGCACCCGGAACAGCTTGAAGAAGTCCACGTTGGCGTACAGGGTATCCTCGATCCACTGCAGGGTTCCCTGGCTCAGACTGCACAGGGCATATCTGACGGTCTTGGCGCCTCTCCGCTGGCTGAAGATGATGAACCGTTCCAGCAGGGCCTCAGAACAGGCGATATCCTTCAGGGCGAGATCTGGCACGCCATGAGCAAACTGCTCGGGCCGCACTTGGCTGTTGATCAGCAGGTACACCACGCTGTCGCTCAGGCCGATGTTCTTGATGAGGAACAGTGTCAGGGTTTCCTCGTCCTTCAGGATGTCCCGGATTCTGATGCCCCTGCCGGCGATTCTCTCGGGGTGTGTTCTCAGGGTGTCCATGAACTGGCTCAGGATGTGCAGCTCGGTCCAGATTCTGCCCAGGTGCTGAGACTCAGGGGCGTTCATCAGCAGCTCTTGGAAGTCCCGGTACACTCTGGCCAGGATGCTGTTGTTGTAGTTGGACACGATGCCAGGGCTTTCGCCAGGTGTGGGGCTCTGAAAGCAGGGGTTGTTCACGTTGCAGAAGATGCCCTGCAGCCAAGGCAGCATTCCGGCAGAAGGCATGGCCTTGTTGGGGAAGTGACACTCGTGGTGGCTGTACAGAGGATTGGCGTTCCGCAGCCAGATCAGCACCAGAAACAGGCTCAGGGGCCACACGAGTTCCACCACGAATCTGATTTTCTGCCGCTTCCGCAGGGTCCAGTTCTTCCACAGCAGCAGCTGAATCTG G 59 PASS PRECISE;SVTYPE=DEL;SVLEN=-2542;END=136069;SUPPORT=259;COVERAGE=335,326,308,260,257;STRAND=+-;AF=0.869;STDEV_LEN=0;STDEV_POS=0 GT:GQ:DR:DV 1/1:60:39:259
Fails, 900 bp
contig_xyz 133527 Sniffles2.DEL.50S0 GGATGAACTCTTGGATTCTTGGGCTCATGTCGGACAGGATGCCGCCCCAGCTTCTCAGGTTGATGCCCTGGGATCTAGAATCCAGCAGTGTGGGCAGCACCCGGAACAGCTTGAAGAAGTCCACGTTGGCGTACAGGGTATCCTCGATCCACTGCAGGGTTCCCTGGCTCAGACTGCACAGGGCATATCTGACGGTCTTGGCGCCTCTCCGCTGGCTGAAGATGATGAACCGTTCCAGCAGGGCCTCAGAACAGGCGATATCCTTCAGGGCGAGATCTGGCACGCCATGAGCAAACTGCTCGGGCCGCACTTGGCTGTTGATCAGCAGGTACACCACGCTGTCGCTCAGGCCGATGTTCTTGATGAGGAACAGTGTCAGGGTTTCCTCGTCCTTCAGGATGTCCCGGATTCTGATGCCCCTGCCGGCGATTCTCTCGGGGTGTGTTCTCAGGGTGTCCATGAACTGGCTCAGGATGTGCAGCTCGGTCCAGATTCTGCCCAGGTGCTGAGACTCAGGGGCGTTCATCAGCAGCTCTTGGAAGTCCCGGTACACTCTGGCCAGGATGCTGTTGTTGTAGTTGGACACGATGCCAGGGCTTTCGCCAGGTGTGGGGCTCTGAAAGCAGGGGTTGTTCACGTTGCAGAAGATGCCCTGCAGCCAAGGCAGCATTCCGGCAGAAGGCATGGCCTTGTTGGGGAAGTGACACTCGTGGTGGCTGTACAGAGGATTGGCGTTCCGCAGCCAGATCAGCACCAGAAACAGGCTCAGGGGCCACACGAGTTCCACCACGAATCTGATTTTCTGCCGCTTCCGCAGGGTCCAGTTCTTCCACAGCAGCAGCTGAATCTGCCGCACGAATCCCATGGTGGCCAGCTCGGTCGTCCCGGGGCCTCTACTGT G 59 PASS PRECISE;SVTYPE=DEL;SVLEN=-2542;END=136069;SUPPORT=259;COVERAGE=335,326,308,260,257;STRAND=+-;AF=0.869;STDEV_LEN=0;STDEV_POS=0 GT:GQ:DR:DV 1/1:60:39:259
Can you suggest a way to maintain these large INDELs among the results? Alternatively, if it is impossible to maintain large INDELs, could a warning or error be returned instead of a silent discard?
Thanks!