Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
26 stars 20 forks source link

writeVcf save invalid vcf with metadata out of order #15

Closed apastore closed 5 years ago

apastore commented 5 years ago

If I read a valid vcf with the last 3.8 Bioconductor version 1.28.1

param_T <- ScanVcfParam(sample = samples_names[1],  geno = c("AD", "MQ", "DP", "AF"), info = c("DP","AF", "CALLERS"))
  Tumor <- readVcf(file = vcf_file[i], genome = "GRCH37", param = param_T)
head of  vcf_file[i
##fileformat=VCFv4.2
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>

and then I save it with

writeVcf(Tumor, paste0("./vcf/",samples_names[1], ".vcf"), index = F)

the header is out of order ##fileformat=VCFv4.2 is not on the first line and there is a Strange SAMPLE field

Thanks for you support!

##SAMPLE=Tumor
##commandline="/home/pastore/data/bcbio/anaconda/bin/freebayes -f /lila/data/abdelwao-lab/pastorea/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --genotype-qualities --strict-vcf --ploidy 2 --targets /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/freebayes/1/B02_816_T1-1_0_249250621-regions-nolcr.bed --min-repeat-entropy 1 --no-partial-observations --min-alternate-fraction 0.05 --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/align/C_K0Y3K8_T001_d/C_K0Y3K8_T001_d-sort.bam /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/align/PON/PON-sort.bam"
##fileDate=20181115
##fileformat=VCFv4.2
##phasing=none
FreekManders commented 5 years ago

I have the same issue. The order of the headerlines is wrong and the ALT line doesn't contain an id. It becomes:

##ALT=Represents any possible alternative allele at this location

instead of

##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">

I think the problems might be with the makeVcfHeader function, which seems to contain some broken code.

vobencha commented 5 years ago

@apastore @FreekManders I believe this was fixed with these two commits.

commit caaaf75eb2004e04ad92eeadfb568d8992a8ad2a 
Author: vobencha <valerie.obenchain@roswellpark.org>
Date:   Tue Nov 27 12:21:54 2018 -0800

    Fix bug in writeVcf() when 'fileformat' is missing

commit 21c508e83c013b1655789cb832217609d2602e4b
Author: vobencha <valerie.obenchain@roswellpark.org>
Date:   Mon Nov 19 12:33:03 2018 -0800

    writeVcf() now writes 'fileFormat' as the first header line

Please reopen the issue if you're still having problems.

FreekManders commented 5 years ago

Hi @vobencha , I installed the latest version of the package from github. However, the ALT header line is still incorrect. I added my session info below.

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              VariantAnnotation_1.29.6    usethis_1.4.0               devtools_2.0.1              Rsamtools_1.34.0            Biostrings_2.50.1          
 [7] XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.2         matrixStats_0.54.0          Biobase_2.42.0             
[13] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0         forcats_0.3.0              
[19] stringr_1.3.1               dplyr_0.7.8                 purrr_0.2.5                 readr_1.3.0                 tidyr_0.8.2                 tibble_1.4.2               
[25] ggplot2_3.1.0               tidyverse_1.2.1            

loaded via a namespace (and not attached):
 [1] nlme_3.1-137             fs_1.2.6                 bitops_1.0-6             lubridate_1.7.4          bit64_0.9-7              progress_1.2.0           httr_1.4.0              
 [8] rprojroot_1.3-2          tools_3.5.1              backports_1.1.3          R6_2.3.0                 DBI_1.0.0                lazyeval_0.2.1           colorspace_1.3-2        
[15] withr_2.1.2              tidyselect_0.2.5         prettyunits_1.0.2        processx_3.2.1           curl_3.2                 bit_1.1-14               compiler_3.5.1          
[22] cli_1.0.1                rvest_0.3.2              xml2_1.2.0               desc_1.2.0               rtracklayer_1.42.1       scales_1.0.0             callr_3.1.0             
[29] digest_0.6.18            pkgconfig_2.0.2          sessioninfo_1.1.1        BSgenome_1.50.0          rlang_0.3.0.1            readxl_1.2.0             rstudioapi_0.8          
[36] RSQLite_2.1.1            bindr_0.1.1              generics_0.0.2           jsonlite_1.6             RCurl_1.95-4.11          magrittr_1.5             GenomeInfoDbData_1.2.0  
[43] Matrix_1.2-14            Rcpp_1.0.0               munsell_0.5.0            stringi_1.2.4            yaml_2.2.0               zlibbioc_1.28.0          pkgbuild_1.0.2          
[50] plyr_1.8.4               grid_3.5.1               blob_1.1.1               crayon_1.3.4             lattice_0.20-35          haven_2.0.0              GenomicFeatures_1.34.1  
[57] hms_0.4.2                ps_1.2.1                 pillar_1.3.1             pkgload_1.0.2            biomaRt_2.38.0           XML_3.98-1.16            glue_1.3.0              
[64] remotes_2.0.2            modelr_0.1.2             cellranger_1.1.0         gtable_0.2.0             assertthat_0.2.0         broom_0.5.1              GenomicAlignments_1.18.0
[71] AnnotationDbi_1.44.0     memoise_1.1.0 
vobencha commented 5 years ago

Hi @FreekManders

Installing packages manually from github is generally not the best approach. We have both an active release (R 3.5.1; Bioconductor 3.8) and devel branch (R 3.6.*; Bioconductor 3.9). If you install manually from github you must be sure you're pulling for the appropriate branch for your version of R. Additionally, the github version often lags behind the version on the canonical git server, git.bioconductor.org.

From your sessionInfo() (thanks for posting) you now have the devel version of VariantAnnotation in your release version of R.

The best way to install Bioconductor packages is to instead use the BiocManager package (replaced the BiocInstaller package). The package will detect your version of R and install the appropriate Bioconductor package versions.

BiocManager::install("VariantAnnotation")

The release version of VariantAnnotation you should be using is 1.28.3 as seen on the release build report:

https://www.bioconductor.org/checkResults/release/bioc-LATEST/

I've tested with VariantAnnotation 1.29.9 (devel) and 1.28.3 (release) using the structural.vcf from the package. In both cases I see 'fileformat' at the top of the output file and ALT is properly written with the ID field.

> fl = system.file("extdata", "structural.vcf", package="VariantAnnotation")
> vcf = readVcf(fl)
> tmp = tempfile()
> writeVcf(vcf, tmp)
> tmp
[1] "/tmp/RtmpdHJDVS/file26b16202f358"

The vcf specs do not specify order in the header fields except to say that fileformat must be first. https://samtools.github.io/hts-specs/VCFv4.2.pdf

The output from the example above does list 'fileformat' first. All major categories are grouped together but header lines such as 'assembly' or 'reference' may not be in slightly different order as the original file. To my knowledge this is not a problem.

If you still see a problem with the ALT field, please provide a small sample file so I can reproduce the issue.

Thanks.

FreekManders commented 5 years ago

Hi @vobencha

Thank you for your helping me. I had switched to the github version of the package, because I thought that version might be newer and fix the issue I was having. I have now removed the package and re-installed it with the BiocManager package.

However, I'm still having the same issue. I attached a modified version of the structural.vcf from the package, which contains the ALT field, as it is present in my vcfs. I also attached the file I get when reading and writing out this example vcf with VariantAnnotation. (I changed the extensions to .txt so I could attach the files) test.txt test_out.txt

As you can see, the ALT field is altered in the output. This alteration results in an error, when trying to read in the file in IGV: Unable to parse header with error: Invalid VCFSimpleHeaderLine: key=ALT name=null

p.s. I noticed that BiocManager installed version 1.28.5 of the package, even though I'm still on R version 3.5.1. However, I don't think this is causing my issue, because I've had the same issue with other versions of the package.

vobencha commented 5 years ago

@FreekManders Thanks for the reproducible example and good catch! This is now fixed in 1.28.7 and 1.29.12. These versions should be available via BiocManager::install() ~noon EST tomorrow. Valerie