cole-trapnell-lab / garnett

Automated cell type classification
MIT License
104 stars 25 forks source link

"Error: is(object = cds, class2 = "CellDataSet") is not TRUE" & Prediction of large amount of "Unknown" #54

Closed nbxszby416 closed 3 years ago

nbxszby416 commented 3 years ago

Hi, I have 2 questions here: 1> I tried to use the function “train_cell_classifier” to train my classifier, but I have this error: Error: is(object = cds, class2 = "CellDataSet") is not TRUE It really confused me since the functions of “check_markers”, “classify_cells”, etc. are all feasible, and actually “train_cell_classifier” WAS feasible before I changed parts of “train_cell_classifier.R” and source it. But I have sourced the original function back, and it doesn’t work…

My sessionInfo() is: "R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0
[3] garnett_0.2.17 monocle3_0.2.3.0
[5] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [7] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[9] IRanges_2.24.1 S4Vectors_0.28.1
[11] MatrixGenerics_1.2.0 matrixStats_0.57.0
[13] Biobase_2.50.0 BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] fs_1.5.0 bitops_1.0-6 usethis_2.0.0
[4] devtools_2.3.2 bit64_4.0.5 doParallel_1.0.16
[7] rprojroot_2.0.2 tools_4.0.3 R6_2.5.0
[10] DBI_1.1.0 colorspace_2.0-0 withr_2.3.0
[13] prettyunits_1.1.1 processx_3.4.5 tidyselect_1.1.0
[16] gridExtra_2.3 curl_4.3 bit_4.0.4
[19] compiler_4.0.3 glmnet_4.0-2 cli_2.2.0
[22] formatR_1.7 desc_1.2.0 DelayedArray_0.16.0
[25] labeling_0.4.2 scales_1.1.1 callr_3.5.1
[28] stringr_1.4.0 digest_0.6.27 rmarkdown_2.6
[31] XVector_0.30.0 pkgconfig_2.0.3 htmltools_0.5.0
[34] sessioninfo_1.1.1 rlang_0.4.10 rstudioapi_0.13
[37] RSQLite_2.2.1 shape_1.4.5 generics_0.1.0
[40] farver_2.0.3 dplyr_1.0.2 RCurl_1.98-1.2
[43] magrittr_2.0.1 GenomeInfoDbData_1.2.4 futile.logger_1.4.3
[46] Matrix_1.3-0 Rcpp_1.0.5 munsell_0.5.0
[49] fansi_0.4.1 viridis_0.5.1 lifecycle_0.2.0
[52] stringi_1.5.3 yaml_2.2.1 zlibbioc_1.36.0
[55] pkgbuild_1.2.0 plyr_1.8.6 grid_4.0.3
[58] blob_1.2.1 ggrepel_0.9.0 forcats_0.5.0
[61] crayon_1.3.4 lattice_0.20-41 splines_4.0.3
[64] ps_1.5.0 knitr_1.30 pillar_1.4.7
[67] igraph_1.2.6 pkgload_1.1.0 reshape2_1.4.4
[70] codetools_0.2-18 futile.options_1.0.1 glue_1.4.2
[73] evaluate_0.14 remotes_2.2.0 lambda.r_1.2.4
[76] BiocManager_1.30.10 vctrs_0.3.6 foreach_1.5.1
[79] testthat_3.0.1 gtable_0.3.0 purrr_0.3.4
[82] assertthat_0.2.1 ggplot2_3.3.3 xfun_0.19
[85] survival_3.2-7 viridisLite_0.3.0 tibble_3.0.4
[88] rly_1.6.2 iterators_1.0.13 tinytex_0.28
[91] memoise_1.1.0 ellipsis_0.3.1
" And my cds is: "class: cell_data_set dim: 6 94655 metadata(1): cds_version assays(1): counts rownames(6): ENSG00000243485 ENSG00000237613 ... ENSG00000239945 ENSG00000237683 rowData names(2): name0 gene_short_name colnames(94655): AAACATACAAAACG-1_1 AAACATACACGACT-1_1 ... TTTGCATGCGTTGA-1_10 TTTGCATGTGTCCC-1_10 colData names(5): TSNE.1 TSNE.2 FACS_type Size_Factor sample reducedDimNames(0): altExpNames(0): “

2> I ran the whole work on 10X PBMCs with 10 mixed dataset(just as mentioned in the paper), but the column of cell_type (prediction) showed >50% “Unknown” (most of which are CD4 T cells and CD8 T cells). I used the same marker file as provided, and I don’t know why(since it is only 26% of unclassified cells in the paper). Is there anything wrong in my code? I delete "Dendritic cells” in my marker file since there are no Dendritic cells in this dataset. (I also use the version with "Dendritic cells” but it doesn’t help) There are no markers >0.5 ambiguity. The cluster-extended type prediction is OK. “ fold_name <- c("regulatory_t", "naive_cytotoxic", "memory_t", "cd14_monocytes", "cytotoxic_t", "b_cells", "cd4_t_helper", "cd34", "cd56_nk", "naive_t") cell_name <- c("CD4 T cells", "CD8 T cells", "CD4 T cells", "Monocytes", "CD8 T cells", "B cells", "CD4 T cells", "CD34+", "NK cells", "CD4 T cells")

matrix1 <- Matrix::readMM(paste0(“/xx/",fold_name[1],"/matrix.mtx"))

pdata

pdata1 <- read.csv(paste0(“/xx/",fold_name[1],"/projection.csv"), header=TRUE, sep=",") rownames(pdata1) <- pdata1[,1] pdata1 <- pdata1[,-1] pdata1 <- data.frame(pdata1,FACS_type=c(cell_name[1]))

fdata

fdata1 <- read.table(paste0(“/xx/",fold_name[1],"/genes.tsv"), header=FALSE, sep="\t") rownames(fdata1) <- fdata1[,1]

rename

row.names(matrix1) <- row.names(fdata1) colnames(matrix1) <- row.names(pdata1) names(fdata1) <- c("name0", "gene_short_name")

cds

cds1 <- new_cell_data_set(as(matrix1, "dgCMatrix"), cell_metadata = pdata1, gene_metadata = fdata1) x <- list(cds1)

for (i in 2:10){ matrix <- Matrix::readMM(paste0(“/xx/",fold_name[i],"/matrix.mtx"))

pdata

pdata <- read.csv(paste0(“/xx/",fold_name[i],"/projection.csv"), header=TRUE, sep=",") rownames(pdata) <- pdata[,1] pdata <- pdata[,-1] pdata <- data.frame(pdata,FACS_type=c(cell_name[i]))

fdata

fdata <- read.table(paste0("/xx/",fold_name[i],"/genes.tsv"), header=FALSE, sep="\t") rownames(fdata) <- fdata[,1]

rename

row.names(matrix) <- row.names(fdata) colnames(matrix) <- row.names(pdata) names(fdata) <- c("name0", "gene_short_name")

cds

cds0 <- new_cell_data_set(as(matrix, "dgCMatrix"), cell_metadata = pdata, gene_metadata = fdata) x <- c(x, cds0) } cds <- combine_cds(x)

library(org.Hs.eg.db) marker_file <- “/xx/pbmc_markers.txt" pbmc_classifier <- train_cell_classifier(cds=cds, marker_file = marker_file, db=org.Hs.eg.db, cds_gene_id_type = "ENSEMBL", num_unknown = 500, marker_file_gene_id_type = "SYMBOL”)

cds <- classify_cells(cds, pbmc_classifier, db = org.Hs.eg.db, cluster_extend = TRUE, cds_gene_id_type = "ENSEMBL") table(pData(cds)$cell_type) B cells CD34+ CD4 T cells CD8 T cells Dendritic cells 9391 3646 8171 7690 653 Monocytes NK cells T cells Unknown 2560 5370 12873 44301 “

Thank you for your hard work!

nbxszby416 commented 3 years ago

Sorry but I have solved the first problem!! But the second problem still exists.

I'll really appreciate it if you can help.

hpliner commented 3 years ago

Hi, can you be more specific about which pbmc dataset you used that gave different results?

nbxszby416 commented 3 years ago

I used the v1 pbmc dataset (Cell Ranger 1.1.0 from 10X genomics), which is the combination of these 10 pure types (CD14+ Monocytes, CD19+ B cells, CD34+ cells, CD4+ Helper T cells, CD4+/CD25+ Regulatory T cells, CD4+/CD45RA+/CD25− Naive T cells, CD4+/CD45RO+ Memory T cells,CD56+ Natural killer cells, CD8+ Cytotoxic T cells and CD8+/CD45RA+ Naive cytotoxic T cells).

nbxszby416 commented 3 years ago

Sorry for the interuption, but I found that I have not solved the first problem... I want to change a little part of the tfidf function and test the result, but meet: Error: is(object = cds, class2 = "CellDataSet") is not TRUE It seems that even when I source the original "train_cell_classifier.R" file, it doesn't work.. What should I do?

Thanks for your work!!

nbxszby416 commented 3 years ago

Hi, it seems that the first problem is because Monocle3 only accept "CellDataSet", and I transfered to Monocle and did my test smoothly!

hpliner commented 3 years ago

I believe this is resolved? If not, please reopen