Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Faithfully recapitulate metadata via a round-trip through `writeVcf` + `readVcf`. #78

Open LTLA opened 6 months ago

LTLA commented 6 months ago

writeVcf performs some unnecessary work that interferes with a perfect roundtrip of a VCF to a file and back. At the very least, these discrepancies interfere with my unit tests that check for correct reproduction of a VCF object from a VCF file; they also have some practical consequences as they introduce differences in the MD5 checksums, which then prevents some deduplication mechanisms in backend storage systems. To illustrate:

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
out <- tempfile()
first <- readVcf(fl)
writeVcf(first, out)
roundtrip <- readVcf(out)

all.equal(roundtrip, first)
## [1] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Lengths: 9, 8 > >"
## [2] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Names: Lengths (9, 8) differ (string compare on first 8) > >"
## [3] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component ## “listData”: Length mismatch: comparison on first 8 components > > >"
## [4] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component “listData”: Component “fileDate”: Attributes: < Component “listData”: Component “Value”: 1 string mismatch > > > >"
## [5] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “reference”: Lengths (4, 0) differ (string compare on first 0) > >"

For mismatches 1-3: as documented in its manpage, writeVcf adds some ##contig headers based on the seqinfo() of the input object. However, in this case, all the sequence information is NA, so perhaps writeVcf might be smart enough to omit the ##contig headers altogether, rather than manufacturing some non-informative placeholders:

##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>

For mismatch 4: as documented, writeVcf replaces the fileDate with the current date. Perhaps you could consider an option to respect any existing date in the VCF object, which is useful for retaining provenance, e.g., if I'm using VariantAnnotation to make some changes to an existing VCF file but I want to keep the date of the original file's creation.

For mismatch 5: I'm not sure what happened here, but I'm guessing this has something to do with the extra contig lines.

Session information ``` R Under development (unstable) (2023-11-29 r85646) Platform: aarch64-apple-darwin22.5.0 Running under: macOS Ventura 13.6.1 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] VariantAnnotation_1.49.2 Rsamtools_2.19.2 [3] Biostrings_2.71.1 XVector_0.43.0 [5] SummarizedExperiment_1.33.1 Biobase_2.63.0 [7] GenomicRanges_1.55.1 GenomeInfoDb_1.39.5 [9] IRanges_2.37.0 S4Vectors_0.41.3 [11] MatrixGenerics_1.15.0 matrixStats_1.2.0 [13] BiocGenerics_0.49.1 loaded via a namespace (and not attached): [1] KEGGREST_1.43.0 rjson_0.2.21 lattice_0.22-5 [4] vctrs_0.6.5 tools_4.4.0 bitops_1.0-7 [7] generics_0.1.3 curl_5.2.0 parallel_4.4.0 [10] tibble_3.2.1 fansi_1.0.6 AnnotationDbi_1.65.2 [13] RSQLite_2.3.4 blob_1.2.4 pkgconfig_2.0.3 [16] Matrix_1.6-4 BSgenome_1.71.1 dbplyr_2.4.0 [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.11 compiler_4.4.0 [22] stringr_1.5.1 progress_1.2.3 codetools_0.2-19 [25] yaml_2.3.8 RCurl_1.98-1.13 pillar_1.9.0 [28] crayon_1.5.2 BiocParallel_1.37.0 DelayedArray_0.29.0 [31] cachem_1.0.8 abind_1.4-5 tidyselect_1.2.0 [34] digest_0.6.33 stringi_1.8.3 restfulr_0.0.15 [37] dplyr_1.1.4 biomaRt_2.59.0 fastmap_1.1.1 [40] grid_4.4.0 cli_3.6.2 SparseArray_1.3.2 [43] magrittr_2.0.3 S4Arrays_1.3.1 GenomicFeatures_1.55.1 [46] XML_3.99-0.16 utf8_1.2.4 rappdirs_0.3.3 [49] filelock_1.0.3 prettyunits_1.2.0 bit64_4.0.5 [52] httr_1.4.7 bit_4.0.5 png_0.1-8 [55] hms_1.1.3 memoise_2.0.1 BiocIO_1.13.0 [58] BiocFileCache_2.11.1 rtracklayer_1.63.0 rlang_1.1.2 [61] glue_1.6.2 DBI_1.2.0 xml2_1.3.6 [64] R6_2.5.1 GenomicAlignments_1.39.0 zlibbioc_1.49.0 ```
vjcitn commented 6 months ago

I find that when bgzip compression and tabix indexing and TabixFile is used, the round trip with structural.vcf(.gz) succeeds cleanly. Can we leave it at that?

LTLA commented 6 months ago

Why can't a perfect round-trip be achieved without compression/indexing/TabixFile? These things have nothing to do with the inclusion of unnecessary ##config headers or the replacement of the fileDate.

vjcitn commented 6 months ago

I agree that failing to support round trips on our own example data is not good and should be repaired. I do think that the bgzipped/tabix context is more common for this file format so if it constitutes a workaround for you I would like to postpone diving into the internals...

LTLA commented 6 months ago

No rush, I only stumbled across this because of the alabaster.vcf tests.