cole-trapnell-lab / garnett

Automated cell type classification
MIT License
99 stars 24 forks source link

Trouble creating my own classifier #43

Closed smk5g5 closed 4 years ago

smk5g5 commented 4 years ago

Hi,

I am trying to create a custom celltype classifier for my work. I am using The Human Cell Atlas bone marrow single-cell data from "The Human Cell Atlas bone marrow single-cell interactive web portal" paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6296228/#R15) . I am running into some problems in terms of creating the reference. The raw counts data I am using is available here (https://www.synapse.org/#!Synapse:syn21869735). It has around 35 cell types but every time I generate the garnett classifier it only recognizes 2/35 cell types as shown below.

Screen Shot 2020-04-09 at 9 39 20 AM

I was wondering if garnett classifier for this data has already been generated and if not what am I doing wrong here? Thanks!

Below is the code I used for the same. `library(data.table) library(org.Hs.eg.db) library(garnett) library(stringr) library(dplyr) library(Seurat) exp_mat <- fread(input="ica_bone_marrow_h5-filtered.txt",sep="\t",header = TRUE, stringsAsFact ors=F)

ensembl_genes <- exp_mat$Gene exp_df <- as.data.frame(exp_mat) map_symbol<- AnnotationDbi::select(org.Hs.eg.db,key=ensembl_genes,keytype="ENSEMBL",columns = c("SYMBOL","ENSEMBL")) selected_symbols <- map_symbol[!is.na(map_symbol$SYMBOL),] sel_genes <- unique(selected_symbols$ENSEMBL) colnames(selected_symbols) <- c("Gene","SYMBOL") print(head(selected_symbols)) sel_expmat <- exp_df[exp_df$Gene %in% sel_genes,] Symb_expmat <- merge(sel_expmat,selected_symbols,by="Gene") Symb_expmat$Gene <- NULL aggsymb_mat <- Symb_expmat %>% group_by(SYMBOL) %>% summarize_all(sum) aggsymb_mat <- as.data.frame(aggsymb_mat) rownames(aggsymb_mat) <- aggsymb_mat$SYMBOL aggsymb_mat$SYMBOL <- NULL

cell_assoc <- read.csv("Immune_census/HCA_cellassociations.txt",stringsAsFactors = F) sel_cell_assoc <- cell_assoc[c("Cell","Cluster.Name")] colnames(sel_cell_assoc) <- c("Cell","Cell_types")

cellnames <- character(0)

for (i in 1:length(sel_cell_assoc$Cell)){ test <- unlist(str_split(sel_cell_assoc$Cell[i],'[.]')) cellnames <- c(cellnames, test[1]) }

sel_cell_assoc$Cellnames <- cellnames sel_cell_assoc$Cell <- NULL sel_cell_assoc <- sel_cell_assoc[c("Cellnames","Cell_types")]

myexp_mat <- data.matrix(aggsymb_mat) sel_cell_assoc2 <- sel_cell_assoc[match(colnames(myexp_mat),sel_cell_assoc$Cellnames),] gene_feat <- as.data.frame(rownames(myexp_mat)) colnames(gene_feat) <- c("gene_short_name") rownames(gene_feat) <- gene_feat$gene_short_name

fd <- new("AnnotatedDataFrame", data = gene_feat) rownames(sel_cell_assoc2) <- sel_cell_assoc2$Cellnames

pd <- new("AnnotatedDataFrame", data = sel_cell_assoc2)

human_cellatlas_garnett <- newCellDataSet(myexp_mat, phenoData = pd, featureData = fd) human_cellatlas_garnett <- estimateSizeFactors(human_cellatlas_garnett) save(human_cellatlas_garnett,fd,pd,sel_cell_assoc2,gene_feat,file = "garnett_training_data.RDa ta")

marker_file_path <- './marker_garnett_HCA.txt'

load("garnett_training_data.RData") marker_check <- check_markers(human_cellatlas_garnett, marker_file_path, db=org.Hs.eg.db, cds_gene_id_type = "SYMBOL", marker_file_gene_id_type = "SYMBOL")

human_cellatlas_classifier <- train_cell_classifier(cds = human_cellatlas_garnett, marker_file = marker_file_path, db=org.Hs.eg.db, cds_gene_id_type = "SYMBOL", marker_file_gene_id_type = "SYMBOL") feature_genes <- get_feature_genes(human_cellatlas_classifier, node = "root", db = org.Hs.eg.db) print(head(feature_genes))`

This is how my marker file looks like. marker_garnett_HCA.txt

hpliner commented 4 years ago

Hello,

The likely issue I see here is the number marker genes you're using. In general, I've found that a few good markers generate much better classifiers than many poorer ones. My marker files never have more than 10-20 and usually have more like 3-4 per cell type.

Alternatively, if the data is already labelled, you can try our new alpha feature for training a marker-free model https://cole-trapnell-lab.github.io/garnett/docs_m3/#1c-train-a-marker-free-classifier Note that this is only currently implemented with monocle3

hpliner commented 4 years ago

I'm going to close this, reopen if this is still an issue