Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

readVcf does not recognize my vcf file? #82

Open igrabski opened 2 months ago

igrabski commented 2 months ago

Hi, when I use the readVcf command, I encounter an error that my input file is not recognized as a VCF. The specific command I am using and error is as follows:

> vcf <- readVcf('../../Downloads/dkfz_groebner_2018_exonic(1)/snvs_ICGC_MB64_somatic.vcf','hg19')
[E::COMPAT_bcf_hdr_read] Input is not detected as bcf or vcf format
Error in scanBcfHeader(bf) : no 'header' line "#CHROM POS ID..."?

The VCF files I am trying to read are downloaded from here; the error occurs with any of the files beginning with snvs_ that I have tested.

I am wondering if these VCF files are older / in a non-standard format that readVcf does not recognize? Please let me know what the issue may be or how I can solve it. Thanks!

vjcitn commented 2 months ago

this is what i see when i click on the vcf source link

image

can you snip out the first 500 lines of one of the problematic files as text and attach? Also provide sessionInfo() result after your error is triggered

igrabski commented 2 months ago

Thanks for the fast response!

sessionInfo() is below:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VariantAnnotation_1.48.1    Rsamtools_2.18.0           
 [3] Biostrings_2.70.1           XVector_0.42.0             
 [5] SummarizedExperiment_1.32.0 Biobase_2.62.0             
 [7] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1        
 [9] IRanges_2.36.0              S4Vectors_0.40.1           
[11] MatrixGenerics_1.14.0       matrixStats_1.1.0          
[13] BiocGenerics_0.48.1        

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0          rjson_0.2.21             lattice_0.22-5          
 [4] vctrs_0.6.4              tools_4.3.1              bitops_1.0-7            
 [7] generics_0.1.3           curl_5.1.0               parallel_4.3.1          
[10] tibble_3.2.1             fansi_1.0.5              AnnotationDbi_1.64.1    
[13] RSQLite_2.3.3            blob_1.2.4               pkgconfig_2.0.3         
[16] Matrix_1.6-5             BSgenome_1.70.2          dbplyr_2.4.0            
[19] lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.3.1          
[22] stringr_1.5.0            progress_1.2.2           codetools_0.2-19        
[25] yaml_2.3.7               RCurl_1.98-1.13          pillar_1.9.0            
[28] crayon_1.5.2             BiocParallel_1.36.0      DelayedArray_0.28.0     
[31] cachem_1.0.8             abind_1.4-5              tidyselect_1.2.0        
[34] digest_0.6.33            stringi_1.8.1            restfulr_0.0.15         
[37] dplyr_1.1.3              biomaRt_2.58.0           fastmap_1.1.1           
[40] grid_4.3.1               cli_3.6.1                SparseArray_1.2.2       
[43] magrittr_2.0.3           S4Arrays_1.2.0           GenomicFeatures_1.54.1  
[46] XML_3.99-0.15            utf8_1.2.4               rappdirs_0.3.3          
[49] filelock_1.0.2           prettyunits_1.2.0        bit64_4.0.5             
[52] httr_1.4.7               bit_4.0.5                png_0.1-8               
[55] hms_1.1.3                memoise_2.0.1            BiocIO_1.12.0           
[58] BiocFileCache_2.10.1     rtracklayer_1.62.0       rlang_1.1.2             
[61] glue_1.6.2               DBI_1.1.3                xml2_1.3.5              
[64] R6_2.5.1                 GenomicAlignments_1.38.0 zlibbioc_1.48.0   

The VCF files are actually very short (maybe that's also weird...) so I just attached the whole file here. I read it into R using read.table and then saved it using write.table.

snvs_example.txt

mtmorgan commented 2 months ago

I think the sample file is missing a line

##fileformat=VCFv4.1

or similar, which is a required field

igrabski commented 2 months ago

Thanks, that helps it run without errors but then I get an empty object?

> vcf <- readVcf('../../Downloads/dkfz_groebner_2018_exonic(1)/snvs_ICGC_MB64_somatic.vcf','hg19')
> vcf
class: CollapsedVCF 
dim: 16 0 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: INFO
  Fields with no header: INFO 
geno(vcf):
  List of length 0:
mtmorgan commented 2 months ago

I guess the files are also missing the ##INFO fields; I'm just parsing the specification and the example file system.file(package="VariantAnnotation", "extdata", "ex2.vcf"). I think the bottom line is that the file is malformed.