PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Unrecognised format for variants in CHM Huddleston2016 truth sets #20

Closed lsantuari closed 5 years ago

lsantuari commented 5 years ago

Hello,

When I try to load the truth sets (VCF files) from the Huddleston2016 dataset I get the following error (for instance for CHM1):

Error in .breakpointRanges(x, ...) : 
  Unrecognised format for variants chr1:710580_A/<INS>, ... (including other variant IDs)

Thank you in advance for the support.

Cheers, Luca

d-cameron commented 5 years ago

A clean install of R 3.6.1 with the following script:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("StructuralVariantAnnotation")
library(StructuralVariantAnnotation)
vcf = readVcf("C:\\Users\\Daniel\\Downloads/CHM1_CHM13_pseudodiploid_SVs.vcf")
gr = breakpointRanges(vcf)

Gives the following output:

Warning message:
In .breakpointRanges(x, ...) :
  Found 9 duplicate row names (duplicates renamed).
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] StructuralVariantAnnotation_1.0.0 VariantAnnotation_1.30.1         
 [3] Rsamtools_2.0.0                   Biostrings_2.52.0                
 [5] XVector_0.24.0                    SummarizedExperiment_1.14.1      
 [7] DelayedArray_0.10.0               BiocParallel_1.18.0              
 [9] matrixStats_0.54.0                Biobase_2.44.0                   
[11] rtracklayer_1.44.2                GenomicRanges_1.36.0             
[13] GenomeInfoDb_1.20.0               IRanges_2.18.1                   
[15] S4Vectors_0.22.0                  BiocGenerics_0.30.0              

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2               compiler_3.6.1           pillar_1.4.2            
 [4] BiocManager_1.30.4       prettyunits_1.0.2        progress_1.2.2          
 [7] GenomicFeatures_1.36.4   bitops_1.0-6             tools_3.6.1             
[10] zlibbioc_1.30.0          biomaRt_2.40.3           digest_0.6.20           
[13] zeallot_0.1.0            bit_1.1-14               BSgenome_1.52.0         
[16] memoise_1.1.0            RSQLite_2.1.2            tibble_2.1.3            
[19] lattice_0.20-38          pkgconfig_2.0.2          rlang_0.4.0             
[22] Matrix_1.2-17            DBI_1.0.0                GenomeInfoDbData_1.2.1  
[25] dplyr_0.8.3              httr_1.4.0               stringr_1.4.0           
[28] hms_0.5.0                vctrs_0.2.0              tidyselect_0.2.5        
[31] bit64_0.9-7              grid_3.6.1               glue_1.3.1              
[34] R6_2.4.0                 AnnotationDbi_1.46.0     XML_3.98-1.20           
[37] purrr_0.3.2              magrittr_1.5             blob_1.2.0              
[40] backports_1.1.4          GenomicAlignments_1.20.1 assertthat_0.2.1        
[43] stringi_1.4.3            RCurl_1.95-4.12          crayon_1.3.4     
d-cameron commented 5 years ago

You might have a old version that doesn't have this bug fixed https://github.com/Bioconductor/VariantAnnotation/issues/19

lsantuari commented 5 years ago

Hi Daniel,

The file "CHM1_CHM13_pseudodiploid_SVs.vcf" works fine for me. However, the files "CHM1_SVs.annotated.vcf.gz" and "CHM13_SVs.annotated.vcf.gz" trigger the "Unrecognised format" error. Do they work for you?

>sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_3.2.0                     dplyr_0.8.3                       StructuralVariantAnnotation_1.1.0
 [4] VariantAnnotation_1.31.3          Rsamtools_2.1.3                   Biostrings_2.53.1                
 [7] XVector_0.25.0                    SummarizedExperiment_1.15.5       DelayedArray_0.11.4              
[10] BiocParallel_1.19.0               matrixStats_0.54.0                Biobase_2.45.0                   
[13] rtracklayer_1.45.1                GenomicRanges_1.37.14             GenomeInfoDb_1.21.1              
[16] IRanges_2.19.10                   S4Vectors_0.23.17                 BiocGenerics_0.31.5              

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1               lattice_0.20-38          prettyunits_1.0.2        assertthat_0.2.1         digest_0.6.20           
 [6] BiocFileCache_1.9.1      R6_2.4.0                 RSQLite_2.1.1            httr_1.4.0               pillar_1.4.2            
[11] zlibbioc_1.31.0          rlang_0.4.0              GenomicFeatures_1.37.3   progress_1.2.2           lazyeval_0.2.2          
[16] curl_3.3                 rstudioapi_0.10          blob_1.1.1               Matrix_1.2-17            labeling_0.3            
[21] stringr_1.4.0            RCurl_1.95-4.12          bit_1.1-14               biomaRt_2.41.6           munsell_0.5.0           
[26] compiler_3.6.1           pkgconfig_2.0.2          askpass_1.1              openssl_1.4              tidyselect_0.2.5        
[31] tibble_2.1.3             GenomeInfoDbData_1.2.1   XML_3.98-1.20            withr_2.1.2              crayon_1.3.4            
[36] dbplyr_1.4.2             GenomicAlignments_1.21.4 bitops_1.0-6             rappdirs_0.3.1           grid_3.6.1              
[41] gtable_0.3.0             DBI_1.0.0                magrittr_1.5             scales_1.0.0             stringi_1.4.3           
[46] tools_3.6.1              bit64_0.9-7              BSgenome_1.53.0          glue_1.3.1               purrr_0.3.2             
[51] hms_0.4.2                AnnotationDbi_1.47.0     colorspace_1.4-1         BiocManager_1.30.4       memoise_1.1.0
d-cameron commented 5 years ago

However, the files "CHM1_SVs.annotated.vcf.gz" and "CHM13_SVs.annotated.vcf.gz" trigger the "Unrecognised format" error. Do they work for you?

They do not work for me because they are malformed VCFs that do not conform to the specifications. The issue is with the SVTYPE field and if you search/replace their custom definitions (e.g. SVTYPE=insertion) with the spec-defined values (e.g. SVTYPE=INS) they work fine.