andreamrau / coseq

Co-Expression Analysis of Sequencing Data
6 stars 2 forks source link

Cluster assignments using logit transformation passes an empty vector #1

Closed pabodha closed 5 years ago

pabodha commented 5 years ago

Hello,

I am trying to create compareARI() plot for my coseq result object using logit transformation. I used following command to create the object. runLogit <- coseq(counts, K=2:30, model="Normal", transformation="logit")

But the compareARI() passes following error. Error in corrplot(ARI, is.corr = FALSE, method = "color", type = "upper", : cl.lim[1] == 0 && cl.lim[2] == 0 is not TRUE

The summary of the object looks like this

summary(runLogit)
*************************************************
Model: Gaussian_pk_Lk_Ck
Transformation: logit
*************************************************
Clusters fit: 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
Clusters with errors: ---
Selected number of clusters via ICL: 3
ICL of selected model: 3670.318
*************************************************
Number of clusters = 3
ICL = 3670.318
*************************************************
Error in names(tab) <- paste("Cluster", names(tab)) :
  'names' attribute [1] must be the same length as the vector [0]

CompareARI() works fine with arcsin transformation.

Could you please help me with solving this issue

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)

Matrix products: default
BLAS: /share/pkg/R/3.5.1_gcc8.1.0/lib64/R/lib/libRblas.so
LAPACK: /share/pkg/R/3.5.1_gcc8.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] corrplot_0.84
 [2] coseq_1.6.1
 [3] org.Mm.eg.db_3.7.0
 [4] anRichment_0.96
 [5] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.4
 [6] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [7] anRichmentMethods_0.87-1
 [8] WGCNA_1.66
 [9] fastcluster_1.1.25
[10] dynamicTreeCut_1.63-1
[11] GO.db_3.7.0
[12] systemPipeR_1.16.1
[13] ShortRead_1.40.0
[14] IntEREst_1.6.1
[15] edgeR_3.24.3
[16] limma_3.38.3
[17] DESeq2_1.22.2
[18] magrittr_1.5
[19] GenomicAlignments_1.18.1
[20] Rsamtools_1.34.0
[21] Biostrings_2.50.2
[22] XVector_0.22.0
[23] SummarizedExperiment_1.12.0
[24] DelayedArray_0.8.0
[25] matrixStats_0.54.0
[26] GenomicFeatures_1.34.1
[27] AnnotationDbi_1.44.0
[28] Biobase_2.42.0
[29] GenomicRanges_1.34.0
[30] GenomeInfoDb_1.18.1
[31] IRanges_2.16.0
[32] S4Vectors_0.20.1
[33] BiocGenerics_0.28.0
[34] BiocParallel_1.16.5
[35] apeglm_1.4.2

loaded via a namespace (and not attached):
  [1] backports_1.1.3        GOstats_2.48.0         Hmisc_4.1-1
  [4] HTSCluster_2.0.8       plyr_1.8.4             lazyeval_0.2.1
  [7] GSEABase_1.44.0        splines_3.5.1          BatchJobs_1.7
 [10] ggplot2_3.1.0          robust_0.4-18          digest_0.6.18
 [13] foreach_1.4.4          htmltools_0.3.6        checkmate_1.9.1
 [16] memoise_1.1.0          BBmisc_1.11            fit.models_0.5-14
 [19] cluster_2.0.7-1        doParallel_1.0.14      DEXSeq_1.28.1
 [22] annotate_1.60.0        bayesm_3.1-1           prettyunits_1.0.2
 [25] colorspace_1.4-0       blob_1.1.1             rrcov_1.4-7
 [28] xfun_0.4               dplyr_0.7.8            crayon_1.3.4
 [31] RCurl_1.95-4.11        graph_1.60.0           genefilter_1.64.0
 [34] bindr_0.1.1            impute_1.56.0          brew_1.0-6
 [37] survival_2.43-3        sendmailR_1.2-1        iterators_1.0.10
 [40] glue_1.3.0             gtable_0.2.0           zlibbioc_1.28.0
 [43] compositions_1.40-2    seqinr_3.4-5           Rgraphviz_2.26.0
 [46] DEoptimR_1.0-8         DESeq_1.34.1           scales_1.0.0
 [49] pheatmap_1.0.12        mvtnorm_1.0-8          DBI_1.0.0
 [52] Rcpp_1.0.0             plotrix_3.7-4          xtable_1.8-3
 [55] progress_1.2.0         emdbook_1.3.10         htmlTable_1.13.1
 [58] foreign_0.8-71         bit_1.1-14             preprocessCore_1.44.0
 [61] Formula_1.2-3          AnnotationForge_1.24.0 htmlwidgets_1.3
 [64] httr_1.4.0             RColorBrewer_1.1-2     acepack_1.4.1
 [67] pkgconfig_2.0.2        XML_3.98-1.16          nnet_7.3-12
 [70] locfit_1.5-9.1         labeling_0.3           tidyselect_0.2.5
 [73] rlang_0.3.1            munsell_0.5.0          tools_3.5.1
 [76] RSQLite_2.1.1          ade4_1.7-13            stringr_1.3.1
 [79] knitr_1.21             bit64_0.9-7            robustbase_0.93-3
 [82] purrr_0.2.5            bindrcpp_0.2.2         RBGL_1.58.1
 [85] biomaRt_2.38.0         compiler_3.5.1         rstudioapi_0.9.0
 [88] e1071_1.7-0.1          tibble_2.0.1           statmod_1.4.30
 [91] geneplotter_1.60.0     pcaPP_1.9-73           stringi_1.2.4
 [94] lattice_0.20-38        Matrix_1.2-15          tensorA_0.36.1
 [97] capushe_1.1.1          pillar_1.3.1           HTSFilter_1.22.1
[100] data.table_1.12.0      bitops_1.0-6           Rmixmod_2.1.2
[103] rtracklayer_1.42.1     R6_2.3.0               latticeExtra_0.6-28
[106] hwriter_1.3.2          RMySQL_0.10.16         gridExtra_2.3
[109] codetools_0.2-16       boot_1.3-20            energy_1.7-5
[112] MASS_7.3-51.1          assertthat_0.2.0       seqLogo_1.48.0
[115] Category_2.48.0        rjson_0.2.20           GenomeInfoDbData_1.2.0
[118] hms_0.4.2              grid_3.5.1             rpart_4.1-13
[121] class_7.3-15           coda_0.19-2            bbmle_1.0.20
[124] numDeriv_2016.8-1      base64enc_0.1-3

Thanks

Pabodha

andreamrau commented 5 years ago

Hi @pabodha, Could you send me a small reproducible example to help me debug? Also, are you able to run compareARI with a subset of the cluster sizes, e.g. compareARI(runLogit, K = c(3,4,5))?

Based on the output, it looks to me like the ICL selected 3 clusters, but after using assigning observations to clusters using the maximum conditional probability for that model, only 2 clusters have observations assigned to them. To double check that, could you also send the output for table(clusters(runLogit))?

Thanks Andrea

pabodha commented 5 years ago

Hello @andreamrau,

Really appreciate your help.

When I restarted the R session, compareARI() with and without K parameter works fine for my original data set. But the dummy data set I have attached here still produce the same set of errors.

compareARI(runLogit, K=3:5)
Error in matrix(0, nrow = ncol(full_labels), ncol = ncol(full_labels)) :
  non-numeric matrix extent
compareARI(runLogit)
Error in corrplot(ARI, is.corr = FALSE, method = "color", type = "upper",  :
  cl.lim[1] == 0 && cl.lim[2] == 0 is not TRUE

Here is the output from my original dataset for table(clusters())

table(clusters(runL))
Error in names(labels) <- rownames(pp) :
  'names' attribute [633] must be the same length as the vector [0]

with Arcsin transformation the output looks like this

table(clusters(runA))

  1   2   3
184 220 229

Thank you

Pabodha

dummyCounts.txt

pabodha commented 5 years ago

Hello Andrea,

I also get an error while creating cluster profile line plots. Boxplots for the profiles generate fine. I checked whether the graphical device has some issues in my end. But seems its working fine.

plot(runA, graphs="profiles")
$profiles

Warning message:
In grid.Call.graphics(C_lines, x$x, x$y, index, x$arrow) :
  semi-transparency is not supported on this device: reported only once per page

Really appreciate your help.

Pabodha

andreamrau commented 5 years ago

Hello Pabodha,

Thanks for your patience -- I finally got around to fixing this. The issue was related to the fact that several of the Gaussian mixture models that were fit for the logit-transformed data in your dummy data had singular covariance matrices, which led to conditional probabilities of cluster membership equal to NaN for all genes (and all of the downstream problems you saw). The solution in a situation like this is to try a more restrictive form of covariance matrix for the Gaussian mixture model. To make this clearer, an informative error message is now printed:

runLogit <- coseq(dummyCounts, K=2:30, model="Normal", transformation="logit")

****************************************
coseq analysis: Normal approach & logit transformation
K = 2 to 30 
Use set.seed() prior to running coseq for reproducible results.
****************************************
Running K = 2 ...
[...]
Running K = 30 ...
Error in NormMixClus(y_profiles = tcounts$tcounts, K = K, subset = NULL,  : 
  The selected model, which is of the form Gaussian_pk_Lk_Ck, resulted in estimation errors.
This is likely due to singular covariance matrices when this form of Gaussian mixture is used.
Try rerunning coseq with either a spherical (pk_Lk_Bk) or diagonal (pk_Lk_I)
covariance matrix form instead:

coseq(..., GaussianModel = "Gaussian_pk_Lk_Bk")
coseq(..., GaussianModel = "Gaussian_pk_Lk_I")

Following up on that, when the following code is run, there is no error:

runLogit <- coseq(dummyCounts, K=2:30, model="Normal", transformation="logit", 
                  GaussianModel = "Gaussian_pk_Lk_Bk")

And you should be able to use all of the helper functions as usual:

summary(runLogit)
compareARI(runLogit)
clusters(runLogit)

Finally, I also fixed the issue with the plot function for the coseqResults class, it should be working now.

All of these changes have been made to coseq version 1.7.2. They should be live as part of Bioconductor Devel in the next couple of days, but in the meantime you can install it using the devtools package:

library(devtools)
install_github("andreamrau/coseq")

Thanks very much for the bug reports. Don't hesitate to get in touch if you find other issues!