Closed smk5g5 closed 4 years ago
Hello,
The likely issue I see here is the number marker genes you're using. In general, I've found that a few good markers generate much better classifiers than many poorer ones. My marker files never have more than 10-20 and usually have more like 3-4 per cell type.
Alternatively, if the data is already labelled, you can try our new alpha feature for training a marker-free model https://cole-trapnell-lab.github.io/garnett/docs_m3/#1c-train-a-marker-free-classifier Note that this is only currently implemented with monocle3
I'm going to close this, reopen if this is still an issue
Hi,
I am trying to create a custom celltype classifier for my work. I am using The Human Cell Atlas bone marrow single-cell data from "The Human Cell Atlas bone marrow single-cell interactive web portal" paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6296228/#R15) . I am running into some problems in terms of creating the reference. The raw counts data I am using is available here (https://www.synapse.org/#!Synapse:syn21869735). It has around 35 cell types but every time I generate the garnett classifier it only recognizes 2/35 cell types as shown below.
I was wondering if garnett classifier for this data has already been generated and if not what am I doing wrong here? Thanks!
Below is the code I used for the same. `library(data.table) library(org.Hs.eg.db) library(garnett) library(stringr) library(dplyr) library(Seurat) exp_mat <- fread(input="ica_bone_marrow_h5-filtered.txt",sep="\t",header = TRUE, stringsAsFact ors=F)
ensembl_genes <- exp_mat$Gene exp_df <- as.data.frame(exp_mat) map_symbol<- AnnotationDbi::select(org.Hs.eg.db,key=ensembl_genes,keytype="ENSEMBL",columns = c("SYMBOL","ENSEMBL")) selected_symbols <- map_symbol[!is.na(map_symbol$SYMBOL),] sel_genes <- unique(selected_symbols$ENSEMBL) colnames(selected_symbols) <- c("Gene","SYMBOL") print(head(selected_symbols)) sel_expmat <- exp_df[exp_df$Gene %in% sel_genes,] Symb_expmat <- merge(sel_expmat,selected_symbols,by="Gene") Symb_expmat$Gene <- NULL aggsymb_mat <- Symb_expmat %>% group_by(SYMBOL) %>% summarize_all(sum) aggsymb_mat <- as.data.frame(aggsymb_mat) rownames(aggsymb_mat) <- aggsymb_mat$SYMBOL aggsymb_mat$SYMBOL <- NULL
cell_assoc <- read.csv("Immune_census/HCA_cellassociations.txt",stringsAsFactors = F) sel_cell_assoc <- cell_assoc[c("Cell","Cluster.Name")] colnames(sel_cell_assoc) <- c("Cell","Cell_types")
cellnames <- character(0)
for (i in 1:length(sel_cell_assoc$Cell)){ test <- unlist(str_split(sel_cell_assoc$Cell[i],'[.]')) cellnames <- c(cellnames, test[1]) }
sel_cell_assoc$Cellnames <- cellnames sel_cell_assoc$Cell <- NULL sel_cell_assoc <- sel_cell_assoc[c("Cellnames","Cell_types")]
myexp_mat <- data.matrix(aggsymb_mat) sel_cell_assoc2 <- sel_cell_assoc[match(colnames(myexp_mat),sel_cell_assoc$Cellnames),] gene_feat <- as.data.frame(rownames(myexp_mat)) colnames(gene_feat) <- c("gene_short_name") rownames(gene_feat) <- gene_feat$gene_short_name
fd <- new("AnnotatedDataFrame", data = gene_feat) rownames(sel_cell_assoc2) <- sel_cell_assoc2$Cellnames
pd <- new("AnnotatedDataFrame", data = sel_cell_assoc2)
human_cellatlas_garnett <- newCellDataSet(myexp_mat, phenoData = pd, featureData = fd) human_cellatlas_garnett <- estimateSizeFactors(human_cellatlas_garnett) save(human_cellatlas_garnett,fd,pd,sel_cell_assoc2,gene_feat,file = "garnett_training_data.RDa ta")
marker_file_path <- './marker_garnett_HCA.txt'
load("garnett_training_data.RData") marker_check <- check_markers(human_cellatlas_garnett, marker_file_path, db=org.Hs.eg.db, cds_gene_id_type = "SYMBOL", marker_file_gene_id_type = "SYMBOL")
human_cellatlas_classifier <- train_cell_classifier(cds = human_cellatlas_garnett, marker_file = marker_file_path, db=org.Hs.eg.db, cds_gene_id_type = "SYMBOL", marker_file_gene_id_type = "SYMBOL") feature_genes <- get_feature_genes(human_cellatlas_classifier, node = "root", db = org.Hs.eg.db) print(head(feature_genes))`
This is how my marker file looks like. marker_garnett_HCA.txt