Open LTLA opened 6 months ago
I find that when bgzip compression and tabix indexing and TabixFile is used, the round trip with structural.vcf(.gz) succeeds cleanly. Can we leave it at that?
Why can't a perfect round-trip be achieved without compression/indexing/TabixFile? These things have nothing to do with the inclusion of unnecessary ##config
headers or the replacement of the fileDate
.
I agree that failing to support round trips on our own example data is not good and should be repaired. I do think that the bgzipped/tabix context is more common for this file format so if it constitutes a workaround for you I would like to postpone diving into the internals...
No rush, I only stumbled across this because of the alabaster.vcf tests.
writeVcf
performs some unnecessary work that interferes with a perfect roundtrip of aVCF
to a file and back. At the very least, these discrepancies interfere with my unit tests that check for correct reproduction of aVCF
object from a VCF file; they also have some practical consequences as they introduce differences in the MD5 checksums, which then prevents some deduplication mechanisms in backend storage systems. To illustrate:For mismatches 1-3: as documented in its manpage,
writeVcf
adds some##contig
headers based on theseqinfo()
of the input object. However, in this case, all the sequence information isNA
, so perhapswriteVcf
might be smart enough to omit the##contig
headers altogether, rather than manufacturing some non-informative placeholders:For mismatch 4: as documented,
writeVcf
replaces thefileDate
with the current date. Perhaps you could consider an option to respect any existing date in theVCF
object, which is useful for retaining provenance, e.g., if I'm using VariantAnnotation to make some changes to an existing VCF file but I want to keep the date of the original file's creation.For mismatch 5: I'm not sure what happened here, but I'm guessing this has something to do with the extra contig lines.
Session information
``` R Under development (unstable) (2023-11-29 r85646) Platform: aarch64-apple-darwin22.5.0 Running under: macOS Ventura 13.6.1 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] VariantAnnotation_1.49.2 Rsamtools_2.19.2 [3] Biostrings_2.71.1 XVector_0.43.0 [5] SummarizedExperiment_1.33.1 Biobase_2.63.0 [7] GenomicRanges_1.55.1 GenomeInfoDb_1.39.5 [9] IRanges_2.37.0 S4Vectors_0.41.3 [11] MatrixGenerics_1.15.0 matrixStats_1.2.0 [13] BiocGenerics_0.49.1 loaded via a namespace (and not attached): [1] KEGGREST_1.43.0 rjson_0.2.21 lattice_0.22-5 [4] vctrs_0.6.5 tools_4.4.0 bitops_1.0-7 [7] generics_0.1.3 curl_5.2.0 parallel_4.4.0 [10] tibble_3.2.1 fansi_1.0.6 AnnotationDbi_1.65.2 [13] RSQLite_2.3.4 blob_1.2.4 pkgconfig_2.0.3 [16] Matrix_1.6-4 BSgenome_1.71.1 dbplyr_2.4.0 [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.11 compiler_4.4.0 [22] stringr_1.5.1 progress_1.2.3 codetools_0.2-19 [25] yaml_2.3.8 RCurl_1.98-1.13 pillar_1.9.0 [28] crayon_1.5.2 BiocParallel_1.37.0 DelayedArray_0.29.0 [31] cachem_1.0.8 abind_1.4-5 tidyselect_1.2.0 [34] digest_0.6.33 stringi_1.8.3 restfulr_0.0.15 [37] dplyr_1.1.4 biomaRt_2.59.0 fastmap_1.1.1 [40] grid_4.4.0 cli_3.6.2 SparseArray_1.3.2 [43] magrittr_2.0.3 S4Arrays_1.3.1 GenomicFeatures_1.55.1 [46] XML_3.99-0.16 utf8_1.2.4 rappdirs_0.3.3 [49] filelock_1.0.3 prettyunits_1.2.0 bit64_4.0.5 [52] httr_1.4.7 bit_4.0.5 png_0.1-8 [55] hms_1.1.3 memoise_2.0.1 BiocIO_1.13.0 [58] BiocFileCache_2.11.1 rtracklayer_1.63.0 rlang_1.1.2 [61] glue_1.6.2 DBI_1.2.0 xml2_1.3.6 [64] R6_2.5.1 GenomicAlignments_1.39.0 zlibbioc_1.49.0 ```