Closed ccasar closed 4 years ago
Thanks for raising your concerns. Regarding the bug in the filterExpression plot, we fixed it in the latest version on github.
Regarding merging data: The more samples, the better the model and the outlier calls The autoencoder should be able to correct for the different batches Merging data is only problematic when different sequencing protocols or tissues where used
When merging data, we recommend to: Check that the size factors are similar Check that a similar set of genes are expressed using the plotExpressedGenes function. Check the plotCountCorHeatmap before and after calling OUTRIDER (normalized = FALSE, normalized = TRUE). The batches should be gone after calling OUTRIDER.
Both outliers that were missed could be due to a very small sample size. In order to better call outliers, you can try to increase the sample size, for example by merging with the same tissue from GTEx.
Thanks for your suggestion on the findEncodingDim function. Can you please provide us with a minimal dataset or example that reproduces your issue. We are happy to incorporate your suggestion once we understand where and how the problem originates.
Best, Vicente
Hi Vicente,
thanks for the tips, we'll look into integrating matching GTEx datasets.
As for the reproducible example, I can't reproduce the error with the example data in your package, so I'll first need the permission to share our data with you.
Best, Christian
Hi Vicente,
can you give me an email address to where I can send our data?
The following code should be able to reproduce the errors we were facing:
library(OUTRIDER)
count_data <- read.table('~/countmatrix.tsv')
# Taken from: ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz
gtf_txdb <- makeTxDbFromGFF('~/Homo_sapiens.GRCh38.98.gtf')
snowparam <- SnowParam(workers = 6)
ods <- OutriderDataSet(countData = count_data)
ods <- filterExpression(x = ods, gtfFile = gtf_txdb, filterGenes = T, savefpkm = T)
ods <- ods[mcols(ods)$passedFilter,]
sampleExclusionMask(ods) <- FALSE
sampleExclusionMask(ods[,c(1,2)]) <- TRUE
ods <- findEncodingDim(ods = ods,
BPPARAM = snowparam)
findEncodingDim
stops with this error:
Error: BiocParallel errors
element index: 1, 2, 3, 4, 5, 6, ...
first error: all(is.finite(value)) is not TRUE
In addition: Warning message:
stop worker failed:
attempt to select less than one element in OneIndex
The error occurs when either using SerialParam, MultiCoreParam or SnowParam and I could reproduce it on two different systems. The error seems to occur only when setting sampleExclusionMask
and using the default value for the params
argument in findEncodingDim
, e.g. when using params = c(4,8)
there is no error.
Best, Christian
Hi Christian, Thanks, you can send me the count matrix to yepez 'at' in.tum.de. We are going to try to reproduce your issue and let you know once we find the bug. Best, Vicente
The email should be on its way.
Hi Chris,
unfortunately we could not reproduce your error yet. Could you please send your 'sessionInfo()'
Thanks
Sure, sessionInfo() for both system that can reproduce the error:
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.9.0
LAPACK: /usr/lib/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices datasets utils methods base
other attached packages:
[1] OUTRIDER_1.4.0 data.table_1.12.6 SummarizedExperiment_1.16.0 DelayedArray_0.12.0
[5] matrixStats_0.55.0 GenomicFeatures_1.38.0 AnnotationDbi_1.48.0 Biobase_2.46.0
[9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.0 S4Vectors_0.24.0
[13] BiocGenerics_0.32.0 BiocParallel_1.20.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 htmlTable_1.13.2 XVector_0.26.0 base64enc_0.1-3 rstudioapi_0.10
[6] bit64_0.9-7 codetools_0.2-16 splines_3.6.2 PRROC_1.3.1 geneplotter_1.64.0
[11] knitr_1.26 zeallot_0.1.0 Formula_1.2-3 jsonlite_1.6 Rsamtools_2.2.1
[16] annotate_1.64.0 cluster_2.1.0 dbplyr_1.4.2 pheatmap_1.0.12 compiler_3.6.2
[21] httr_1.4.1 backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18 lazyeval_0.2.2
[26] acepack_1.4.1 htmltools_0.4.0 prettyunits_1.0.2 tools_3.6.2 gtable_0.3.0
[31] glue_1.3.1 GenomeInfoDbData_1.2.2 dplyr_0.8.3 rappdirs_0.3.1 Rcpp_1.0.3
[36] vctrs_0.2.0 Biostrings_2.54.0 gdata_2.18.0 rtracklayer_1.46.0 iterators_1.0.12
[41] xfun_0.11 stringr_1.4.0 lifecycle_0.1.0 renv_0.7.1-20 gtools_3.8.1
[46] XML_3.98-1.20 dendextend_1.12.0 MASS_7.3-51.4 zlibbioc_1.32.0 scales_1.1.0
[51] TSP_1.1-7 pcaMethods_1.78.0 hms_0.5.2 RColorBrewer_1.1-2 BBmisc_1.11
[56] curl_4.2 memoise_1.1.0 heatmaply_0.16.0 gridExtra_2.3 ggplot2_3.2.1
[61] biomaRt_2.42.0 rpart_4.1-15 latticeExtra_0.6-28 stringi_1.4.3 RSQLite_2.1.2
[66] genefilter_1.68.0 gclus_1.3.2 foreach_1.4.7 checkmate_1.9.4 seriation_1.2-8
[71] caTools_1.17.1.2 rlang_0.4.2 pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-38
[76] purrr_0.3.3 GenomicAlignments_1.22.1 htmlwidgets_1.5.1 bit_1.1-14 tidyselect_0.2.5
[81] plyr_1.8.4 magrittr_1.5 DESeq2_1.26.0 R6_2.4.1 gplots_3.0.1.1
[86] Hmisc_4.3-0 DBI_1.0.0 pillar_1.4.2 foreign_0.8-72 survival_3.1-8
[91] RCurl_1.95-4.12 nnet_7.3-12 tibble_2.1.3 crayon_1.3.4 KernSmooth_2.23-16
[96] BiocFileCache_1.10.2 plotly_4.9.1 viridis_0.5.1 progress_1.2.2 locfit_1.5-9.1
[101] grid_3.6.2 blob_1.2.0 digest_0.6.23 webshot_0.5.2 xtable_1.8-4
[106] tidyr_1.0.0 openssl_1.4.1 munsell_0.5.0 registry_0.5-1 viridisLite_0.3.0
[111] askpass_1.1
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices datasets utils
[8] methods base
other attached packages:
[1] OUTRIDER_1.4.0 data.table_1.12.6
[3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
[5] matrixStats_0.55.0 GenomicFeatures_1.38.0
[7] AnnotationDbi_1.48.0 Biobase_2.46.0
[9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[11] IRanges_2.20.0 S4Vectors_0.24.0
[13] BiocGenerics_0.32.0 BiocParallel_1.20.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 htmlTable_1.13.2 XVector_0.26.0
[4] base64enc_0.1-3 rstudioapi_0.10 bit64_0.9-7
[7] codetools_0.2-16 splines_3.6.0 PRROC_1.3.1
[10] geneplotter_1.64.0 knitr_1.26 zeallot_0.1.0
[13] Formula_1.2-3 jsonlite_1.6 Rsamtools_2.2.1
[16] annotate_1.64.0 cluster_2.1.0 dbplyr_1.4.2
[19] pheatmap_1.0.12 compiler_3.6.0 httr_1.4.1
[22] backports_1.1.5 assertthat_0.2.1 Matrix_1.2-17
[25] lazyeval_0.2.2 acepack_1.4.1 htmltools_0.4.0
[28] prettyunits_1.0.2 tools_3.6.0 gtable_0.3.0
[31] glue_1.3.1 GenomeInfoDbData_1.2.2 dplyr_0.8.3
[34] rappdirs_0.3.1 Rcpp_1.0.3 vctrs_0.2.0
[37] Biostrings_2.54.0 gdata_2.18.0 rtracklayer_1.46.0
[40] iterators_1.0.12 xfun_0.11 stringr_1.4.0
[43] lifecycle_0.1.0 renv_0.8.3 gtools_3.8.1
[46] XML_3.98-1.20 dendextend_1.12.0 MASS_7.3-51.4
[49] zlibbioc_1.32.0 scales_1.1.0 TSP_1.1-7
[52] pcaMethods_1.78.0 hms_0.5.2 RColorBrewer_1.1-2
[55] BBmisc_1.11 curl_4.2 memoise_1.1.0
[58] heatmaply_0.16.0 gridExtra_2.3 ggplot2_3.2.1
[61] biomaRt_2.42.0 rpart_4.1-15 latticeExtra_0.6-28
[64] stringi_1.4.3 RSQLite_2.1.2 genefilter_1.68.0
[67] gclus_1.3.2 foreach_1.4.7 checkmate_1.9.4
[70] seriation_1.2-8 caTools_1.17.1.2 rlang_0.4.1
[73] pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-38
[76] purrr_0.3.3 GenomicAlignments_1.22.1 htmlwidgets_1.5.1
[79] bit_1.1-14 tidyselect_0.2.5 plyr_1.8.4
[82] magrittr_1.5 DESeq2_1.26.0 R6_2.4.1
[85] gplots_3.0.1.1 Hmisc_4.3-0 DBI_1.0.0
[88] pillar_1.4.2 foreign_0.8-72 survival_2.44-1.1
[91] RCurl_1.95-4.12 nnet_7.3-12 tibble_2.1.3
[94] crayon_1.3.4 KernSmooth_2.23-15 BiocFileCache_1.10.2
[97] plotly_4.9.1 viridis_0.5.1 progress_1.2.2
[100] locfit_1.5-9.1 grid_3.6.0 blob_1.2.0
[103] digest_0.6.22 webshot_0.5.1 xtable_1.8-4
[106] tidyr_1.0.0 openssl_1.4.1 munsell_0.5.0
[109] registry_0.5-1 viridisLite_0.3.0 askpass_1.1
I just tried setting up a centos 7 based singularity container and within the container I'm also able to reproduce the error. But I guess it's not too helpful for debugging.
Should I try to set up something like a VirtualBox image that can reproduce the error?
Hello, Christian,
The bug is fixed in the latest github version from a few hours ago.
It depended a bit on the injected outliers and could be reproduced on the second run. Also, it only occurred for certain dimensions, which took some time to figure out. The error was a bug with the exclusion mask, which resulted in values that were too high and were considered infinite. This problem is solved with cc94d3a
Thanks for the detailed information and reports. Nice to see you working with OUTRIDER.
Best wishes, Stefan
Hi Stefan,
yes, can confirm the current github version fixes the bug for me as well, thanks!
great perfect !
Hi,
we have already successfully used OUTRIDER to identify helpful insights into one of our RNA-Seq datasets, but some observation made us wondering if we are missing possible outlier candidates:
We have a base data set of 42 samples and a second data set from similar tissue but different sequencing batch consisting of 6 samples. We tried analysing them on their own and in combination. We observed at least one gene that had a significant adjusted pvalue (<0.1) only after adding the additional samples:
Judging from the plots the expression pattern looks pretty similar (pvalues shown in the popup seem to be incorrect, i.e. different than the ones reported in the actual data), but there was an overall negative shift in the median expression.
Both times we used findEncodingDim with default parameters and in both cases q = 4 was recommended.
We've also seen other cases were from a visual inspection and previous in vitro experiments genes seem aberrant, but OUTRIDER's adjust pvalues are > 0.5, e.g.:
Since you recommend not to combine different datasets due to OUTRIDER not being able to handle batch effects, can you think of a solution to better handle these possible outliers?
Side note: I had to slightly adapt your code, since for our dataset findEncodingDim always ran into errors when iterating through different q values. I've added some error catching to R/method-gridSearch.R: