gagneurlab / OUTRIDER

OUTRIDER: OUTlier in RNA-seq fInDER is an R-based framework to find aberrantly expressed genes in RNA-seq data
MIT License
49 stars 11 forks source link

Missing possible outliers due to high pvalues #20

Closed ccasar closed 4 years ago

ccasar commented 4 years ago

Hi,

we have already successfully used OUTRIDER to identify helpful insights into one of our RNA-Seq datasets, but some observation made us wondering if we are missing possible outlier candidates:

We have a base data set of 42 samples and a second data set from similar tissue but different sequencing batch consisting of 6 samples. We tried analysing them on their own and in combination. We observed at least one gene that had a significant adjusted pvalue (<0.1) only after adding the additional samples:

non_sig_gene

sig_gene

Judging from the plots the expression pattern looks pretty similar (pvalues shown in the popup seem to be incorrect, i.e. different than the ones reported in the actual data), but there was an overall negative shift in the median expression.

Both times we used findEncodingDim with default parameters and in both cases q = 4 was recommended.

We've also seen other cases were from a visual inspection and previous in vitro experiments genes seem aberrant, but OUTRIDER's adjust pvalues are > 0.5, e.g.:

non_aberrant

Since you recommend not to combine different datasets due to OUTRIDER not being able to handle batch effects, can you think of a solution to better handle these possible outliers?

Side note: I had to slightly adapt your code, since for our dataset findEncodingDim always ran into errors when iterating through different q values. I've added some error catching to R/method-gridSearch.R:

-    ods <- OUTRIDER(ods, controlData=TRUE,q=encoding_dim, BPPARAM=BPPARAM, ...)
-    eloss <- evalAucPRLoss(ods)
+   
+    eloss <- tryCatch({
+      ods <- OUTRIDER(ods, controlData=TRUE,q=encoding_dim, BPPARAM=BPPARAM, ...)
+      evalAucPRLoss(ods)
+    }, error = function(e) {
+      warning('Catching error during encoding search')
+      NaN
+    })
vyepez88 commented 4 years ago

Thanks for raising your concerns. Regarding the bug in the filterExpression plot, we fixed it in the latest version on github.

Regarding merging data: The more samples, the better the model and the outlier calls The autoencoder should be able to correct for the different batches Merging data is only problematic when different sequencing protocols or tissues where used

When merging data, we recommend to: Check that the size factors are similar Check that a similar set of genes are expressed using the plotExpressedGenes function. Check the plotCountCorHeatmap before and after calling OUTRIDER (normalized = FALSE, normalized = TRUE). The batches should be gone after calling OUTRIDER.

Both outliers that were missed could be due to a very small sample size. In order to better call outliers, you can try to increase the sample size, for example by merging with the same tissue from GTEx.

Thanks for your suggestion on the findEncodingDim function. Can you please provide us with a minimal dataset or example that reproduces your issue. We are happy to incorporate your suggestion once we understand where and how the problem originates.

Best, Vicente

ccasar commented 4 years ago

Hi Vicente,

thanks for the tips, we'll look into integrating matching GTEx datasets.

As for the reproducible example, I can't reproduce the error with the example data in your package, so I'll first need the permission to share our data with you.

Best, Christian

ccasar commented 4 years ago

Hi Vicente,

can you give me an email address to where I can send our data?

The following code should be able to reproduce the errors we were facing:

library(OUTRIDER)

count_data <- read.table('~/countmatrix.tsv')
# Taken from: ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz
gtf_txdb <- makeTxDbFromGFF('~/Homo_sapiens.GRCh38.98.gtf')
snowparam <- SnowParam(workers = 6) 

ods <- OutriderDataSet(countData = count_data)
ods <- filterExpression(x = ods, gtfFile = gtf_txdb, filterGenes = T, savefpkm = T)
ods <- ods[mcols(ods)$passedFilter,]

sampleExclusionMask(ods) <- FALSE
sampleExclusionMask(ods[,c(1,2)]) <- TRUE

ods <- findEncodingDim(ods = ods,
                       BPPARAM = snowparam)

findEncodingDim stops with this error:

Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: all(is.finite(value)) is not TRUE
In addition: Warning message:
stop worker failed:
  attempt to select less than one element in OneIndex 

The error occurs when either using SerialParam, MultiCoreParam or SnowParam and I could reproduce it on two different systems. The error seems to occur only when setting sampleExclusionMask and using the default value for the params argument in findEncodingDim, e.g. when using params = c(4,8) there is no error.

Best, Christian

vyepez88 commented 4 years ago

Hi Christian, Thanks, you can send me the count matrix to yepez 'at' in.tum.de. We are going to try to reproduce your issue and let you know once we find the bug. Best, Vicente

ccasar commented 4 years ago

The email should be on its way.

brechtmann commented 4 years ago

Hi Chris,

unfortunately we could not reproduce your error yet. Could you please send your 'sessionInfo()'

Thanks

ccasar commented 4 years ago

Sure, sessionInfo() for both system that can reproduce the error:

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.9.0
LAPACK: /usr/lib/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] OUTRIDER_1.4.0              data.table_1.12.6           SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
 [5] matrixStats_0.55.0          GenomicFeatures_1.38.0      AnnotationDbi_1.48.0        Biobase_2.46.0             
 [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0         IRanges_2.20.0              S4Vectors_0.24.0           
[13] BiocGenerics_0.32.0         BiocParallel_1.20.0        

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1         htmlTable_1.13.2         XVector_0.26.0           base64enc_0.1-3          rstudioapi_0.10         
  [6] bit64_0.9-7              codetools_0.2-16         splines_3.6.2            PRROC_1.3.1              geneplotter_1.64.0      
 [11] knitr_1.26               zeallot_0.1.0            Formula_1.2-3            jsonlite_1.6             Rsamtools_2.2.1         
 [16] annotate_1.64.0          cluster_2.1.0            dbplyr_1.4.2             pheatmap_1.0.12          compiler_3.6.2          
 [21] httr_1.4.1               backports_1.1.5          assertthat_0.2.1         Matrix_1.2-18            lazyeval_0.2.2          
 [26] acepack_1.4.1            htmltools_0.4.0          prettyunits_1.0.2        tools_3.6.2              gtable_0.3.0            
 [31] glue_1.3.1               GenomeInfoDbData_1.2.2   dplyr_0.8.3              rappdirs_0.3.1           Rcpp_1.0.3              
 [36] vctrs_0.2.0              Biostrings_2.54.0        gdata_2.18.0             rtracklayer_1.46.0       iterators_1.0.12        
 [41] xfun_0.11                stringr_1.4.0            lifecycle_0.1.0          renv_0.7.1-20            gtools_3.8.1            
 [46] XML_3.98-1.20            dendextend_1.12.0        MASS_7.3-51.4            zlibbioc_1.32.0          scales_1.1.0            
 [51] TSP_1.1-7                pcaMethods_1.78.0        hms_0.5.2                RColorBrewer_1.1-2       BBmisc_1.11             
 [56] curl_4.2                 memoise_1.1.0            heatmaply_0.16.0         gridExtra_2.3            ggplot2_3.2.1           
 [61] biomaRt_2.42.0           rpart_4.1-15             latticeExtra_0.6-28      stringi_1.4.3            RSQLite_2.1.2           
 [66] genefilter_1.68.0        gclus_1.3.2              foreach_1.4.7            checkmate_1.9.4          seriation_1.2-8         
 [71] caTools_1.17.1.2         rlang_0.4.2              pkgconfig_2.0.3          bitops_1.0-6             lattice_0.20-38         
 [76] purrr_0.3.3              GenomicAlignments_1.22.1 htmlwidgets_1.5.1        bit_1.1-14               tidyselect_0.2.5        
 [81] plyr_1.8.4               magrittr_1.5             DESeq2_1.26.0            R6_2.4.1                 gplots_3.0.1.1          
 [86] Hmisc_4.3-0              DBI_1.0.0                pillar_1.4.2             foreign_0.8-72           survival_3.1-8          
 [91] RCurl_1.95-4.12          nnet_7.3-12              tibble_2.1.3             crayon_1.3.4             KernSmooth_2.23-16      
 [96] BiocFileCache_1.10.2     plotly_4.9.1             viridis_0.5.1            progress_1.2.2           locfit_1.5-9.1          
[101] grid_3.6.2               blob_1.2.0               digest_0.6.23            webshot_0.5.2            xtable_1.8-4            
[106] tidyr_1.0.0              openssl_1.4.1            munsell_0.5.0            registry_0.5-1           viridisLite_0.3.0       
[111] askpass_1.1
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] OUTRIDER_1.4.0              data.table_1.12.6          
 [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
 [5] matrixStats_0.55.0          GenomicFeatures_1.38.0     
 [7] AnnotationDbi_1.48.0        Biobase_2.46.0             
 [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[11] IRanges_2.20.0              S4Vectors_0.24.0           
[13] BiocGenerics_0.32.0         BiocParallel_1.20.0        

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1         htmlTable_1.13.2         XVector_0.26.0          
  [4] base64enc_0.1-3          rstudioapi_0.10          bit64_0.9-7             
  [7] codetools_0.2-16         splines_3.6.0            PRROC_1.3.1             
 [10] geneplotter_1.64.0       knitr_1.26               zeallot_0.1.0           
 [13] Formula_1.2-3            jsonlite_1.6             Rsamtools_2.2.1         
 [16] annotate_1.64.0          cluster_2.1.0            dbplyr_1.4.2            
 [19] pheatmap_1.0.12          compiler_3.6.0           httr_1.4.1              
 [22] backports_1.1.5          assertthat_0.2.1         Matrix_1.2-17           
 [25] lazyeval_0.2.2           acepack_1.4.1            htmltools_0.4.0         
 [28] prettyunits_1.0.2        tools_3.6.0              gtable_0.3.0            
 [31] glue_1.3.1               GenomeInfoDbData_1.2.2   dplyr_0.8.3             
 [34] rappdirs_0.3.1           Rcpp_1.0.3               vctrs_0.2.0             
 [37] Biostrings_2.54.0        gdata_2.18.0             rtracklayer_1.46.0      
 [40] iterators_1.0.12         xfun_0.11                stringr_1.4.0           
 [43] lifecycle_0.1.0          renv_0.8.3               gtools_3.8.1            
 [46] XML_3.98-1.20            dendextend_1.12.0        MASS_7.3-51.4           
 [49] zlibbioc_1.32.0          scales_1.1.0             TSP_1.1-7               
 [52] pcaMethods_1.78.0        hms_0.5.2                RColorBrewer_1.1-2      
 [55] BBmisc_1.11              curl_4.2                 memoise_1.1.0           
 [58] heatmaply_0.16.0         gridExtra_2.3            ggplot2_3.2.1           
 [61] biomaRt_2.42.0           rpart_4.1-15             latticeExtra_0.6-28     
 [64] stringi_1.4.3            RSQLite_2.1.2            genefilter_1.68.0       
 [67] gclus_1.3.2              foreach_1.4.7            checkmate_1.9.4         
 [70] seriation_1.2-8          caTools_1.17.1.2         rlang_0.4.1             
 [73] pkgconfig_2.0.3          bitops_1.0-6             lattice_0.20-38         
 [76] purrr_0.3.3              GenomicAlignments_1.22.1 htmlwidgets_1.5.1       
 [79] bit_1.1-14               tidyselect_0.2.5         plyr_1.8.4              
 [82] magrittr_1.5             DESeq2_1.26.0            R6_2.4.1                
 [85] gplots_3.0.1.1           Hmisc_4.3-0              DBI_1.0.0               
 [88] pillar_1.4.2             foreign_0.8-72           survival_2.44-1.1       
 [91] RCurl_1.95-4.12          nnet_7.3-12              tibble_2.1.3            
 [94] crayon_1.3.4             KernSmooth_2.23-15       BiocFileCache_1.10.2    
 [97] plotly_4.9.1             viridis_0.5.1            progress_1.2.2          
[100] locfit_1.5-9.1           grid_3.6.0               blob_1.2.0              
[103] digest_0.6.22            webshot_0.5.1            xtable_1.8-4            
[106] tidyr_1.0.0              openssl_1.4.1            munsell_0.5.0           
[109] registry_0.5-1           viridisLite_0.3.0        askpass_1.1 
ccasar commented 4 years ago

I just tried setting up a centos 7 based singularity container and within the container I'm also able to reproduce the error. But I guess it's not too helpful for debugging.

Should I try to set up something like a VirtualBox image that can reproduce the error?

loipf commented 4 years ago

Hello, Christian,

The bug is fixed in the latest github version from a few hours ago.

It depended a bit on the injected outliers and could be reproduced on the second run. Also, it only occurred for certain dimensions, which took some time to figure out. The error was a bug with the exclusion mask, which resulted in values that were too high and were considered infinite. This problem is solved with cc94d3a

Thanks for the detailed information and reports. Nice to see you working with OUTRIDER.

Best wishes, Stefan

ccasar commented 4 years ago

Hi Stefan,

yes, can confirm the current github version fixes the bug for me as well, thanks!

loipf commented 4 years ago

great perfect !