UMCUGenetics / MutationalPatterns

R package for extracting and visualizing mutational patterns in base substitution catalogues
MIT License
104 stars 45 forks source link

Error converting vcfs to GRanges obejects #14

Closed liggettla closed 7 years ago

liggettla commented 7 years ago

I have a code block that was working a couple of days ago but now is throwing an error, and it is when I use the following code: vcfs <- read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19") which gives me this error: Error in (function (...) : all elements in '...' must be GRanges objects

To troubleshoot i'm just trying to read in two identical vcf files which read in okay, there just seems to be a problem when converting to GRanges objects. I tried just playing with GRanges() directly but I couldn't figure out a fix.

roelj commented 7 years ago

Hi @liggettla . Could you give me an example file so I can reproduce the problem? Thanks!

liggettla commented 7 years ago

No problem here is a test file

a.vcf.zip

roelj commented 7 years ago

Thanks for the quick response. I couldn't reproduce the problem. See my output:

> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in keepSeqlevels(vcf, groups) :
  invalid seqlevels‘chr21’, ‘chr22’were ignored
Warning in FUN(X[[i]], ...) :
  34 position(s) with indels and multiple alternative alleles are removed.
> vcfs
GRangesList object of length 1:
$sampleA 
GRanges object with 1138 ranges and 5 metadata columns:
                     seqnames                 ranges strand | paramRangeID
                        <Rle>              <IRanges>  <Rle> |     <factor>
   chr1:23392660_T/A     chr1 [ 23392660,  23392660]      * |         <NA>
   chr1:48063502_C/T     chr1 [ 48063502,  48063502]      * |         <NA>
   chr1:70914923_G/C     chr1 [ 70914923,  70914923]      * |         <NA>
   chr1:79463642_A/G     chr1 [ 79463642,  79463642]      * |         <NA>
  chr1:115227814_G/A     chr1 [115227814, 115227814]      * |         <NA>
                 ...      ...                    ...    ... .          ...
  chr18:22343090_G/T    chr18   [22343090, 22343090]      * |         <NA>
  chr18:22343095_G/A    chr18   [22343095, 22343095]      * |         <NA>
  chr18:51329242_A/T    chr18   [51329242, 51329242]      * |         <NA>
  chr20:23959915_A/G    chr20   [23959915, 23959915]      * |         <NA>
  chr19:41084477_G/A    chr19   [41084477, 41084477]      * |         <NA>
                                REF                ALT        QUAL      FILTER
                     <DNAStringSet> <DNAStringSetList>   <numeric> <character>
   chr1:23392660_T/A              T                  A    141646.0           .
   chr1:48063502_C/T              C                  T     14079.5           .
   chr1:70914923_G/C              G                  C     15511.1           .
   chr1:79463642_A/G              A                  G     23866.2           .
  chr1:115227814_G/A              G                  A         0.0           .
                 ...            ...                ...         ...         ...
  chr18:22343090_G/T              G                  T 4.95972e+04           .
  chr18:22343095_G/A              G                  A 4.83326e+04           .
  chr18:51329242_A/T              A                  T 1.68995e+04           .
  chr20:23959915_A/G              A                  G 5.22666e-13           .
  chr19:41084477_G/A              G                  A 4.96633e-14           .

-------
seqinfo: 22 sequences from hg19 genome
>

So it's hard for me to figure out what's wrong. Could you give me the output of sessionInfo()? That way I can try to get an equal environment.

Thanks again for reporting the issue!

liggettla commented 7 years ago

Happy to help. It is possible that this is being caused by some dependency inconsistency. I am using a different computer than yesterday, and this might be causing the problem.

Here is the sessionInfo() output: R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS

Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] MutationalPatterns_1.1.4 BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0
[4] rtracklayer_1.36.0 Biostrings_2.44.0 XVector_0.16.0
[7] NMF_0.20.6 synchronicity_1.1.9.1 bigmemory_4.5.19
[10] bigmemory.sri_0.1.3 Biobase_2.36.0 cluster_2.0.6
[13] rngtools_1.2.4 pkgmaker_0.22 registry_0.3
[16] GenomicRanges_1.28.0 GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0 ggplot2_2.2.1
[22] BiocInstaller_1.26.0 devtools_1.12.0

loaded via a namespace (and not attached): [1] httr_1.2.1 jsonlite_1.4 foreach_1.4.3 GenomeInfoDbData_0.99.0
[5] Rsamtools_1.28.0 yaml_2.1.14 RSQLite_1.1-2 backports_1.0.5
[9] lattice_0.20-35 quadprog_1.5-5 digest_0.6.12 RColorBrewer_1.1-2
[13] colorspace_1.3-2 htmltools_0.3.5 Matrix_1.2-8 plyr_1.8.4
[17] XML_3.98-1.6 biomaRt_2.32.0 zlibbioc_1.22.0 xtable_1.8-2
[21] scales_0.4.1 BiocParallel_1.10.0 pracma_2.0.4 git2r_0.18.0
[25] tibble_1.3.0 withr_1.0.2 SummarizedExperiment_1.6.0 GenomicFeatures_1.28.0
[29] lazyeval_0.2.0 magrittr_1.5 memoise_1.1.0 evaluate_0.10
[33] doParallel_1.0.10 tools_3.4.0 matrixStats_0.52.2 gridBase_0.4-7
[37] stringr_1.2.0 munsell_0.4.3 DelayedArray_0.2.0 AnnotationDbi_1.38.0
[41] compiler_3.4.0 grid_3.4.0 RCurl_1.95-4.8 iterators_1.0.8
[45] VariantAnnotation_1.22.0 bitops_1.0-6 base64enc_0.1-3 labeling_0.3
[49] rmarkdown_1.5 gtable_0.2.0 codetools_0.2-15 DBI_0.6-1
[53] curl_2.5 reshape2_1.4.2 R6_2.2.0 GenomicAlignments_1.12.0
[57] gridExtra_2.2.1 knitr_1.15.1 rprojroot_1.2 stringi_1.1.5
[61] Rcpp_0.12.10

roelj commented 7 years ago

Thanks for the sessionInfo(). That looks a lot like my computer at work. Unfortunately, I am not there, so I have to postpone finding a solution to tomorrow, or even Friday.

If you want to try this, you could try to load the sample data that comes with the package (see the commands in the vignette. In the case that works, then we might have to look at the VCF contents. Otherwise, we know it's either a bug in the code, or a dependency problem.

I will let you know the results of loading your sample file at my work computer.

liggettla commented 7 years ago

Good idea, i just grabbed two of your samples and re-ran the code and everything works as expected. I will play with my vcfs to see if I can figure out what the difference might be. I really didn't expect the vcf files to be the difference as everything was working with the same vcf files on my other computer yesterday. In any case I'll see if I can figure out what is going on.

roelj commented 7 years ago

We use VariantAnnotation's readVcf() to read the VCF files. I am also interested in knowing what the differences are, so that maybe we can do something about it.

roelj commented 7 years ago

I have some interesting results.

> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Error in keepSeqlevels(vcf, groups) : invalid seqlevels: chr21, chr22

Then if I modify a.vcf to include the following contigs:

##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrM,length=16571>

It seems to run fine:

> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in FUN(X[[i]], ...) :
  34 position(s) with indels and multiple alternative alleles are removed.

So, in a.vcf, there are no chr21 and chr22 variants, which could explain the problem. If I then remove all the ##contig lines from the header in a.vcf, I get:

> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Error in keepSeqlevels(vcf, groups) : 
  invalid seqlevels: chr13, chr14, chr21, chr22, chrY

Then, when I include only the ##contig lines for chr13, chr14, chr21, chr22, and chrY, I get:

> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
Warning in FUN(X[[i]], ...) :
  34 position(s) with indels and multiple alternative alleles are removed.

So, you'll have to make the VCF files compatible with the way VariantAnnotations readVcf() handles them, and that seems to be a bit tricky. If you find out more, please let me know.

Here is my sessionInfo():

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /gnu/store/vr0005kgawx69vx97r86a0g5ispjrhhs-openblas-0.2.19/lib/libopenblasp-r0.2.19.so
LAPACK: /gnu/store/ilcapk65g6p0483nilbyxzwjy2imlryr-r-minimal-3.4.0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0                  
 [3] rtracklayer_1.36.0                Biostrings_2.44.0                
 [5] XVector_0.16.0                    MutationalPatterns_1.2.0         
 [7] NMF_0.20.6                        synchronicity_1.1.9.1            
 [9] bigmemory_4.5.19                  bigmemory.sri_0.1.3              
[11] Biobase_2.36.0                    cluster_2.0.6                    
[13] rngtools_1.2.4                    pkgmaker_0.22                    
[15] registry_0.3                      GenomicRanges_1.28.0             
[17] GenomeInfoDb_1.12.0               IRanges_2.10.0                   
[19] S4Vectors_0.14.0                  BiocGenerics_0.22.0              

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.0 VariantAnnotation_1.22.0  
 [3] reshape2_1.4.2             lattice_0.20-35           
 [5] colorspace_1.3-2           GenomicFeatures_1.28.0    
 [7] pracma_2.0.4               XML_3.98-1.6              
 [9] DBI_0.6-1                  BiocParallel_1.10.0       
[11] RColorBrewer_1.1-2         matrixStats_0.52.2        
[13] GenomeInfoDbData_0.99.0    foreach_1.4.3             
[15] plyr_1.8.4                 stringr_1.2.0             
[17] zlibbioc_1.22.0            munsell_0.4.3             
[19] gtable_0.2.0               codetools_0.2-15          
[21] memoise_1.1.0              biomaRt_2.32.0            
[23] doParallel_1.0.10          AnnotationDbi_1.38.0      
[25] Rcpp_0.12.10               xtable_1.8-2              
[27] scales_0.4.1               DelayedArray_0.2.0        
[29] gridExtra_2.2.1            Rsamtools_1.28.0          
[31] ggplot2_2.2.1              digest_0.6.12             
[33] stringi_1.1.5              grid_3.4.0                
[35] quadprog_1.5-5             tools_3.4.0               
[37] bitops_1.0-6               magrittr_1.5              
[39] lazyeval_0.2.0             RCurl_1.95-0.1.2          
[41] tibble_1.3.0               RSQLite_1.1-2             
[43] Matrix_1.2-8               gridBase_0.4-7            
[45] iterators_1.0.8            GenomicAlignments_1.12.0  
[47] compiler_3.4.0            
liggettla commented 7 years ago

Hmm that is strange that the vcf is causing the problems. I will try modifying it when I'm back on my desktop to see if I can replicate what you are seeing. I did try running everything again on my laptop with the same vcf file and everything is still running fine without modifying the input, so it seems like a one of the packages must be functioning a bit differently. I figured MutationalPatterns would be the culprit, but I am running v1.1.4 on both of my machines and you have 1.2.0, so probably not. I did just check what is similar between your computer and my desktop where processing isn't working and different from my laptop where the processing is working, and I get the following list. Maybe something there is causing the problem.

I'll keep playing around, but I also put all our package versions into a script if it makes it a little easier to look at the differences. packageDiffs.zip

BSgenome_1.44.0
rtracklayer_1.36.0
synchronicity_1.1.9.1
bigmemory_4.5.19
bigmemory.sri_0.1.3
GenomicRanges_1.28.0
GenomeInfoDb_1.12.0
IRanges_2.10.0
S4Vectors_0.14.0
BiocGenerics_0.22.0
GenomeInfoDbData_0.99.0
Rsamtools_1.28.0
biomaRt_2.32.0
zlibbioc_1.22.0
BiocParallel_1.10.0
SummarizedExperiment_1.6.0
GenomicFeatures_1.28.0
matrixStats_0.52.2
DelayedArray_0.2.0
AnnotationDbi_1.38.0
VariantAnnotation_1.22.0
GenomicAlignments_1.12.0
roelj commented 7 years ago

This is strange indeed. FWIW, MutationalPatterns 1.1.4 is the same as 1.2.0. We submitted 1.1.4 to Bioconductor, which they bumped to 1.2.0 when Bioconductor 3.5 came out. Code-wise it is exactly the same.

Thanks for making the diff script. I could give you an exact copy of my R environment which can be extracted in /gnu (so it does not interfere with your existing environment). But since that doesn't seem to be working as expected, that may not be too useful.

I'll let you know when I think of something that could help. Please let me know when there's anything I could do or test for you.

jaesvi commented 7 years ago

Hi there, I am having the same problem as @liggettla when trying to load my own VCFs. I can, however, load the example data without issue. My sessionInfo is:

sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS

Matrix products: default BLAS: /home/cog/jvalleinclan/bin/R-3.4.0/lib/libRblas.so LAPACK: /home/cog/jvalleinclan/bin/R-3.4.0/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] MutationalPatterns_1.2.0 NMF_0.20.6 Biobase_2.36.0 cluster_2.0.6 rngtools_1.2.4
[6] pkgmaker_0.22 registry_0.3 BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0 rtracklayer_1.36.0
[11] Biostrings_2.44.0 XVector_0.16.0 GenomicRanges_1.28.0 GenomeInfoDb_1.12.0 IRanges_2.10.0
[16] S4Vectors_0.14.0 BiocGenerics_0.22.0 BiocInstaller_1.26.0

loaded via a namespace (and not attached): [1] SummarizedExperiment_1.6.0 VariantAnnotation_1.22.0 reshape2_1.4.2 lattice_0.20-35 colorspace_1.3-2 GenomicFeatures_1.28.0
[7] pracma_2.0.4 XML_3.98-1.6 DBI_0.6-1 BiocParallel_1.10.0 RColorBrewer_1.1-2 matrixStats_0.52.2
[13] GenomeInfoDbData_0.99.0 foreach_1.4.3 plyr_1.8.4 stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3
[19] gtable_0.2.0 codetools_0.2-15 memoise_1.1.0 biomaRt_2.32.0 doParallel_1.0.10 AnnotationDbi_1.38.0
[25] Rcpp_0.12.10 xtable_1.8-2 scales_0.4.1 DelayedArray_0.2.0 gridExtra_2.2.1 Rsamtools_1.28.0
[31] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5 grid_3.4.0 quadprog_1.5-5 tools_3.4.0
[37] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 lazyeval_0.2.0 tibble_1.3.0 RSQLite_1.1-2
[43] Matrix_1.2-10 gridBase_0.4-7 iterators_1.0.8 GenomicAlignments_1.12.0 compiler_3.4.0

Following your discussion, I checked and in my file there are variants in each chromosome and all those chromosomes are being referred in the contig headers. So any idea/help on how to modify these VCFs to make them MutationalPatterns / VariantAnnotations compatible would be appreciated. Cheers, Jose

roelj commented 7 years ago

Hi @jaesvi . Thanks for joining in on this issue. Which program was used to create the VCF file?

Could you try to load them with readVcf() from the VariantAnnotation package, and see if that leads to an error? Maybe that will give us a better clue on what's wrong.

I don't think manually editing the VCF files is a proper solution. If you have sample data to share that produces the same issue, then please submit them here, or e-mail me at the address in the DESCRIPTION file of this package, so I can reproduce the same error locally.

Thanks for your time.

roelj commented 7 years ago

@jaesvi @liggettla : We have most likely found a fix for the problem. Could you try using the Github version (that includes the fix)?

This involves running the following code:

library(devtools)
install_github("UMCUgenetics/MutationalPatterns")

Please let me know whether loading your data now works.

liggettla commented 7 years ago

So I reran everything with the newest build on your github and everything seems to be working as expected. I did initially get an error of an unexpected length of chromosome 19 which I have as 53576266 in my vcf files. When I changed it to the expected length of 59128983 in the vcf files everything worked. But it seems like this might have been caused by some variables saved from previous sessions because I cannot seem to get the error again even with the original length of 53576266 in the vcfs, so probably not an issue on your end. Thanks for the help.

roelj commented 7 years ago

Thanks @liggettla. I'll wait for the results of @jaesvi before pushing this bugfix to Bioconductor.

jaesvi commented 7 years ago

Unfortunately it does not work for me. When I try to load one VCF instead of two I get a more informative error message: vcfs <- read_vcfs_as_granges(vcf_files = vcf_file_test, sample_names = sample_name, genome="BSgenome.Hsapiens.UCSC.hg19") Warning in .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames Error in GenomeInfoDb:::getDanglingSeqlevels(x, new2old = new2old, force = force, : The following seqlevels are to be dropped but are currently in use (i.e. have ranges on them): chrM. Please use the 'pruning.mode' argument to control how to prune 'x', that is, how to remove the ranges in 'x' that are on these sequences. For example, do something like:

seqlevels(x, pruning.mode="coarse") <- new_seqlevels

or

keepSeqlevels(x, new_seqlevels, pruning.mode="coarse")

See ?seqinfo for a description of the pruning modes. This actually makes me realize that I may be facing the old '1' vs 'chr1' problem. Is it the case or should it be compatible? I can send you an e-mail with an example VCF file.

roelj commented 7 years ago

The function can handle the 1 vs chr1 problem -- it converts to a single naming convention. It would be great if we could test and reproduce it with your VCF file, so please do e-mail.

roelj commented 7 years ago

Thank you for providing the sample data.

The error indeed has to do with the MT variants. The read_vcfs_as_granges function would have removed the MT variants, if it weren't for the error.

If you want to keep the MT variants, you could specify it with the group="all" parameter, so:

read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19", group = "all")

In commit 581b818 we changed this behaviour to remove variants that aren't included using the "group" parameter. So, if you don't want to keep the MT variants in your analysis, you can use the latest Github version.

Please let me know whether this indeed solves the problem for you.

jaesvi commented 7 years ago

Hi, Indeed the latest commit solves the problem and I can load not only my test file but all my VCF files. Thanks a lot.

roelj commented 7 years ago

Great!

Thanks @liggettla and @jaesvi for reporting the issues. The fixes will be in the next release that we will make available through Bioconductor.