Closed liggettla closed 7 years ago
Hi @liggettla . Could you give me an example file so I can reproduce the problem? Thanks!
Thanks for the quick response. I couldn't reproduce the problem. See my output:
> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in keepSeqlevels(vcf, groups) :
invalid seqlevels‘chr21’, ‘chr22’were ignored
Warning in FUN(X[[i]], ...) :
34 position(s) with indels and multiple alternative alleles are removed.
> vcfs
GRangesList object of length 1:
$sampleA
GRanges object with 1138 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
chr1:23392660_T/A chr1 [ 23392660, 23392660] * | <NA>
chr1:48063502_C/T chr1 [ 48063502, 48063502] * | <NA>
chr1:70914923_G/C chr1 [ 70914923, 70914923] * | <NA>
chr1:79463642_A/G chr1 [ 79463642, 79463642] * | <NA>
chr1:115227814_G/A chr1 [115227814, 115227814] * | <NA>
... ... ... ... . ...
chr18:22343090_G/T chr18 [22343090, 22343090] * | <NA>
chr18:22343095_G/A chr18 [22343095, 22343095] * | <NA>
chr18:51329242_A/T chr18 [51329242, 51329242] * | <NA>
chr20:23959915_A/G chr20 [23959915, 23959915] * | <NA>
chr19:41084477_G/A chr19 [41084477, 41084477] * | <NA>
REF ALT QUAL FILTER
<DNAStringSet> <DNAStringSetList> <numeric> <character>
chr1:23392660_T/A T A 141646.0 .
chr1:48063502_C/T C T 14079.5 .
chr1:70914923_G/C G C 15511.1 .
chr1:79463642_A/G A G 23866.2 .
chr1:115227814_G/A G A 0.0 .
... ... ... ... ...
chr18:22343090_G/T G T 4.95972e+04 .
chr18:22343095_G/A G A 4.83326e+04 .
chr18:51329242_A/T A T 1.68995e+04 .
chr20:23959915_A/G A G 5.22666e-13 .
chr19:41084477_G/A G A 4.96633e-14 .
-------
seqinfo: 22 sequences from hg19 genome
>
So it's hard for me to figure out what's wrong.
Could you give me the output of sessionInfo()
? That way I can try to get an equal environment.
Thanks again for reporting the issue!
Happy to help. It is possible that this is being caused by some dependency inconsistency. I am using a different computer than yesterday, and this might be causing the problem.
Here is the sessionInfo() output: R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS
Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] MutationalPatterns_1.1.4 BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0
[4] rtracklayer_1.36.0 Biostrings_2.44.0 XVector_0.16.0
[7] NMF_0.20.6 synchronicity_1.1.9.1 bigmemory_4.5.19
[10] bigmemory.sri_0.1.3 Biobase_2.36.0 cluster_2.0.6
[13] rngtools_1.2.4 pkgmaker_0.22 registry_0.3
[16] GenomicRanges_1.28.0 GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0 ggplot2_2.2.1
[22] BiocInstaller_1.26.0 devtools_1.12.0
loaded via a namespace (and not attached):
[1] httr_1.2.1 jsonlite_1.4 foreach_1.4.3 GenomeInfoDbData_0.99.0
[5] Rsamtools_1.28.0 yaml_2.1.14 RSQLite_1.1-2 backports_1.0.5
[9] lattice_0.20-35 quadprog_1.5-5 digest_0.6.12 RColorBrewer_1.1-2
[13] colorspace_1.3-2 htmltools_0.3.5 Matrix_1.2-8 plyr_1.8.4
[17] XML_3.98-1.6 biomaRt_2.32.0 zlibbioc_1.22.0 xtable_1.8-2
[21] scales_0.4.1 BiocParallel_1.10.0 pracma_2.0.4 git2r_0.18.0
[25] tibble_1.3.0 withr_1.0.2 SummarizedExperiment_1.6.0 GenomicFeatures_1.28.0
[29] lazyeval_0.2.0 magrittr_1.5 memoise_1.1.0 evaluate_0.10
[33] doParallel_1.0.10 tools_3.4.0 matrixStats_0.52.2 gridBase_0.4-7
[37] stringr_1.2.0 munsell_0.4.3 DelayedArray_0.2.0 AnnotationDbi_1.38.0
[41] compiler_3.4.0 grid_3.4.0 RCurl_1.95-4.8 iterators_1.0.8
[45] VariantAnnotation_1.22.0 bitops_1.0-6 base64enc_0.1-3 labeling_0.3
[49] rmarkdown_1.5 gtable_0.2.0 codetools_0.2-15 DBI_0.6-1
[53] curl_2.5 reshape2_1.4.2 R6_2.2.0 GenomicAlignments_1.12.0
[57] gridExtra_2.2.1 knitr_1.15.1 rprojroot_1.2 stringi_1.1.5
[61] Rcpp_0.12.10
Thanks for the sessionInfo()
. That looks a lot like my computer at work. Unfortunately, I am not there, so I have to postpone finding a solution to tomorrow, or even Friday.
If you want to try this, you could try to load the sample data that comes with the package (see the commands in the vignette. In the case that works, then we might have to look at the VCF contents. Otherwise, we know it's either a bug in the code, or a dependency problem.
I will let you know the results of loading your sample file at my work computer.
Good idea, i just grabbed two of your samples and re-ran the code and everything works as expected. I will play with my vcfs to see if I can figure out what the difference might be. I really didn't expect the vcf files to be the difference as everything was working with the same vcf files on my other computer yesterday. In any case I'll see if I can figure out what is going on.
We use VariantAnnotation
's readVcf()
to read the VCF files. I am also interested in knowing what the differences are, so that maybe we can do something about it.
I have some interesting results.
> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Error in keepSeqlevels(vcf, groups) : invalid seqlevels: chr21, chr22
Then if I modify a.vcf
to include the following contigs:
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrM,length=16571>
It seems to run fine:
> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in FUN(X[[i]], ...) :
34 position(s) with indels and multiple alternative alleles are removed.
So, in a.vcf
, there are no chr21 and chr22 variants, which could explain the problem.
If I then remove all the ##contig
lines from the header in a.vcf
, I get:
> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Error in keepSeqlevels(vcf, groups) :
invalid seqlevels: chr13, chr14, chr21, chr22, chrY
Then, when I include only the ##contig
lines for chr13, chr14, chr21, chr22, and chrY, I get:
> vcfs = read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
Warning in FUN(X[[i]], ...) :
34 position(s) with indels and multiple alternative alleles are removed.
So, you'll have to make the VCF files compatible with the way VariantAnnotation
s readVcf()
handles them, and that seems to be a bit tricky. If you find out more, please let me know.
Here is my sessionInfo()
:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /gnu/store/vr0005kgawx69vx97r86a0g5ispjrhhs-openblas-0.2.19/lib/libopenblasp-r0.2.19.so
LAPACK: /gnu/store/ilcapk65g6p0483nilbyxzwjy2imlryr-r-minimal-3.4.0/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0
[3] rtracklayer_1.36.0 Biostrings_2.44.0
[5] XVector_0.16.0 MutationalPatterns_1.2.0
[7] NMF_0.20.6 synchronicity_1.1.9.1
[9] bigmemory_4.5.19 bigmemory.sri_0.1.3
[11] Biobase_2.36.0 cluster_2.0.6
[13] rngtools_1.2.4 pkgmaker_0.22
[15] registry_0.3 GenomicRanges_1.28.0
[17] GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.6.0 VariantAnnotation_1.22.0
[3] reshape2_1.4.2 lattice_0.20-35
[5] colorspace_1.3-2 GenomicFeatures_1.28.0
[7] pracma_2.0.4 XML_3.98-1.6
[9] DBI_0.6-1 BiocParallel_1.10.0
[11] RColorBrewer_1.1-2 matrixStats_0.52.2
[13] GenomeInfoDbData_0.99.0 foreach_1.4.3
[15] plyr_1.8.4 stringr_1.2.0
[17] zlibbioc_1.22.0 munsell_0.4.3
[19] gtable_0.2.0 codetools_0.2-15
[21] memoise_1.1.0 biomaRt_2.32.0
[23] doParallel_1.0.10 AnnotationDbi_1.38.0
[25] Rcpp_0.12.10 xtable_1.8-2
[27] scales_0.4.1 DelayedArray_0.2.0
[29] gridExtra_2.2.1 Rsamtools_1.28.0
[31] ggplot2_2.2.1 digest_0.6.12
[33] stringi_1.1.5 grid_3.4.0
[35] quadprog_1.5-5 tools_3.4.0
[37] bitops_1.0-6 magrittr_1.5
[39] lazyeval_0.2.0 RCurl_1.95-0.1.2
[41] tibble_1.3.0 RSQLite_1.1-2
[43] Matrix_1.2-8 gridBase_0.4-7
[45] iterators_1.0.8 GenomicAlignments_1.12.0
[47] compiler_3.4.0
Hmm that is strange that the vcf is causing the problems. I will try modifying it when I'm back on my desktop to see if I can replicate what you are seeing. I did try running everything again on my laptop with the same vcf file and everything is still running fine without modifying the input, so it seems like a one of the packages must be functioning a bit differently. I figured MutationalPatterns
would be the culprit, but I am running v1.1.4 on both of my machines and you have 1.2.0, so probably not. I did just check what is similar between your computer and my desktop where processing isn't working and different from my laptop where the processing is working, and I get the following list. Maybe something there is causing the problem.
I'll keep playing around, but I also put all our package versions into a script if it makes it a little easier to look at the differences. packageDiffs.zip
BSgenome_1.44.0
rtracklayer_1.36.0
synchronicity_1.1.9.1
bigmemory_4.5.19
bigmemory.sri_0.1.3
GenomicRanges_1.28.0
GenomeInfoDb_1.12.0
IRanges_2.10.0
S4Vectors_0.14.0
BiocGenerics_0.22.0
GenomeInfoDbData_0.99.0
Rsamtools_1.28.0
biomaRt_2.32.0
zlibbioc_1.22.0
BiocParallel_1.10.0
SummarizedExperiment_1.6.0
GenomicFeatures_1.28.0
matrixStats_0.52.2
DelayedArray_0.2.0
AnnotationDbi_1.38.0
VariantAnnotation_1.22.0
GenomicAlignments_1.12.0
This is strange indeed. FWIW, MutationalPatterns 1.1.4 is the same as 1.2.0. We submitted 1.1.4 to Bioconductor, which they bumped to 1.2.0 when Bioconductor 3.5 came out. Code-wise it is exactly the same.
Thanks for making the diff script. I could give you an exact copy of my R environment which can be extracted in /gnu
(so it does not interfere with your existing environment). But since that doesn't seem to be working as expected, that may not be too useful.
I'll let you know when I think of something that could help. Please let me know when there's anything I could do or test for you.
Hi there, I am having the same problem as @liggettla when trying to load my own VCFs. I can, however, load the example data without issue. My sessionInfo is:
sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS
Matrix products: default BLAS: /home/cog/jvalleinclan/bin/R-3.4.0/lib/libRblas.so LAPACK: /home/cog/jvalleinclan/bin/R-3.4.0/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] MutationalPatterns_1.2.0 NMF_0.20.6 Biobase_2.36.0 cluster_2.0.6 rngtools_1.2.4
[6] pkgmaker_0.22 registry_0.3 BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0 rtracklayer_1.36.0
[11] Biostrings_2.44.0 XVector_0.16.0 GenomicRanges_1.28.0 GenomeInfoDb_1.12.0 IRanges_2.10.0
[16] S4Vectors_0.14.0 BiocGenerics_0.22.0 BiocInstaller_1.26.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.6.0 VariantAnnotation_1.22.0 reshape2_1.4.2 lattice_0.20-35 colorspace_1.3-2 GenomicFeatures_1.28.0
[7] pracma_2.0.4 XML_3.98-1.6 DBI_0.6-1 BiocParallel_1.10.0 RColorBrewer_1.1-2 matrixStats_0.52.2
[13] GenomeInfoDbData_0.99.0 foreach_1.4.3 plyr_1.8.4 stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3
[19] gtable_0.2.0 codetools_0.2-15 memoise_1.1.0 biomaRt_2.32.0 doParallel_1.0.10 AnnotationDbi_1.38.0
[25] Rcpp_0.12.10 xtable_1.8-2 scales_0.4.1 DelayedArray_0.2.0 gridExtra_2.2.1 Rsamtools_1.28.0
[31] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5 grid_3.4.0 quadprog_1.5-5 tools_3.4.0
[37] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 lazyeval_0.2.0 tibble_1.3.0 RSQLite_1.1-2
[43] Matrix_1.2-10 gridBase_0.4-7 iterators_1.0.8 GenomicAlignments_1.12.0 compiler_3.4.0
Following your discussion, I checked and in my file there are variants in each chromosome and all those chromosomes are being referred in the contig headers. So any idea/help on how to modify these VCFs to make them MutationalPatterns / VariantAnnotations compatible would be appreciated. Cheers, Jose
Hi @jaesvi . Thanks for joining in on this issue. Which program was used to create the VCF file?
Could you try to load them with readVcf()
from the VariantAnnotation
package, and see if that leads to an error? Maybe that will give us a better clue on what's wrong.
I don't think manually editing the VCF files is a proper solution. If you have sample data to share that produces the same issue, then please submit them here, or e-mail me at the address in the DESCRIPTION file of this package, so I can reproduce the same error locally.
Thanks for your time.
@jaesvi @liggettla : We have most likely found a fix for the problem. Could you try using the Github version (that includes the fix)?
This involves running the following code:
library(devtools)
install_github("UMCUgenetics/MutationalPatterns")
Please let me know whether loading your data now works.
So I reran everything with the newest build on your github and everything seems to be working as expected. I did initially get an error of an unexpected length of chromosome 19 which I have as 53576266
in my vcf files. When I changed it to the expected length of 59128983
in the vcf files everything worked. But it seems like this might have been caused by some variables saved from previous sessions because I cannot seem to get the error again even with the original length of 53576266
in the vcfs, so probably not an issue on your end. Thanks for the help.
Thanks @liggettla. I'll wait for the results of @jaesvi before pushing this bugfix to Bioconductor.
Unfortunately it does not work for me. When I try to load one VCF instead of two I get a more informative error message: vcfs <- read_vcfs_as_granges(vcf_files = vcf_file_test, sample_names = sample_name, genome="BSgenome.Hsapiens.UCSC.hg19") Warning in .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames Error in GenomeInfoDb:::getDanglingSeqlevels(x, new2old = new2old, force = force, : The following seqlevels are to be dropped but are currently in use (i.e. have ranges on them): chrM. Please use the 'pruning.mode' argument to control how to prune 'x', that is, how to remove the ranges in 'x' that are on these sequences. For example, do something like:
seqlevels(x, pruning.mode="coarse") <- new_seqlevels
or
keepSeqlevels(x, new_seqlevels, pruning.mode="coarse")
See ?seqinfo for a description of the pruning modes. This actually makes me realize that I may be facing the old '1' vs 'chr1' problem. Is it the case or should it be compatible? I can send you an e-mail with an example VCF file.
The function can handle the 1
vs chr1
problem -- it converts to a single naming convention.
It would be great if we could test and reproduce it with your VCF file, so please do e-mail.
Thank you for providing the sample data.
The error indeed has to do with the MT variants. The read_vcfs_as_granges
function would have removed the MT variants, if it weren't for the error.
If you want to keep the MT variants, you could specify it with the group="all"
parameter, so:
read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19", group = "all")
In commit 581b818 we changed this behaviour to remove variants that aren't included using the "group" parameter. So, if you don't want to keep the MT variants in your analysis, you can use the latest Github version.
Please let me know whether this indeed solves the problem for you.
Hi, Indeed the latest commit solves the problem and I can load not only my test file but all my VCF files. Thanks a lot.
Great!
Thanks @liggettla and @jaesvi for reporting the issues. The fixes will be in the next release that we will make available through Bioconductor.
I have a code block that was working a couple of days ago but now is throwing an error, and it is when I use the following code:
vcfs <- read_vcfs_as_granges(vcf_files, sample_names, genome = "BSgenome.Hsapiens.UCSC.hg19")
which gives me this error:Error in (function (...) : all elements in '...' must be GRanges objects
To troubleshoot i'm just trying to read in two identical vcf files which read in okay, there just seems to be a problem when converting to GRanges objects. I tried just playing with
GRanges()
directly but I couldn't figure out a fix.