number of genes - Githubissues

igordot commented 3 years ago

I was using the full Seurat object (317,111 cells). I noticed that there are 87,659 features, which is more than expected. It's possible the individual datasets have various discrepancies with regards to gene names and those accumulate during merging.

Some of the genes are listed as both symbols and Ensembl IDs. For example, both "SYNRG" and "ENSG00000006114" are present.

Some of the genes are listed under multiple names. For example, "SNORD39" has 8 slightly different versions:

> grep("SNORD39", rownames(so), value = TRUE)
[1] "SNORD39"                 "SNORD39.1"              
[3] "SNORD39.2"               "SNORD39.3"              
[5] "SNORD39.ENSG00000263723" "SNORD39.ENSG00000274535"
[7] "SNORD39.ENSG00000264379" "SNORD39.ENSG00000264997"

Is this expected? Just wanted to share this in case you have not noticed.

PaulaNietoG commented 3 years ago

Hello! Thanks for pointing this out! It is true there are discrepancies between the gene names. I am working on creating a function that would deal with this! Will let you know when that's available

bjstewart1 commented 3 years ago

Yes this is way too many features. Looks like the feature harmonisation between datasets in constructing the object needs re-visiting.

bjstewart1 commented 3 years ago

This is my attempt to rationalise the gene names. There are various hacky work arounds here. Ideally the feature names should be harmonised before creating the master object I think, but this code whittles the number of features down to 31955 genes all of which with a symbol. This is a much more sensible number I think.

library(reticulate)
ad <- import("anndata")
sc <- import("scanpy")

#read in the TIC atlas
adata <- sc$read_h5ad("data/TICAtlas.h5ad")
adata$X <- adata$raw$X
table(duplicated(adata$var$features))

features <- data.frame("original_features" = adata$var$features, "features" = adata$var$features)
#there are 4 problems here as I see it
#1 symbols which have been duplicated marked with a "."
#2 there are ensemble gene IDs mixed with symbols; 
#3 there are cryptic genes names X[numeric]
#4 there are genes which are lower case like Xist - this should all be human so upper case.

#move everything to uppercase - this is an easy win.
features$features <- toupper(features$features)

#some of these features have ensemble genes already embedded in their name - so let's grab these and convert these to symbols
features$ensemble <-  1:nrow(features) %in% grep("ENSG", features$features) #find the features that have ensemble genes in
ensembles <- unlist(lapply(strsplit(features[features$ensemble, "features"], "[.]"), function(x){x <- unlist(x)
grep("ENS", x, value = TRUE)})) #get the ensemble codes
library(biomaRt) #fire up biomart
ens_mrt <- biomaRt::useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") #use the ensemble homosapiens stuff
ensemble_genes <- getBM(attributes = c("external_gene_name", "ensembl_gene_id"), filters = "ensembl_gene_id", values = ensembles, mart = ens_mrt) #get a dataframe of ensemble genes and their symbols
ensemble_genes <- ensemble_genes[match(ensembles, ensemble_genes$ensembl_gene_id), ] #match the ensembles to this list
features[features$ensemble, "features"] <- ensemble_genes$external_gene_name #place these symbols back in where the messy ensemble-esque gene names were

#sort out the duplicates with some "[.][digit]" at the end
features$features <- gsub("[.][0-9]", "", features$features)

#now see if we can clean this up further by getting ensemble IDs for the features and mapping back to symbols
feature_genes <- getBM(attributes = c("external_gene_name", "ensembl_gene_id"), filters = "external_gene_name",
                        values = features$features, mart = ens_mrt)
#get unique features from this
feature_genes <- feature_genes[!duplicated(feature_genes$external_gene_name), ]
features$features <- feature_genes[match(features$features, feature_genes$external_gene_name), "external_gene_name"]
#first get rid of the NAs
table(is.na(features$features)) #lots of NAs - but this brings us to about where we need to be in terms of number of genes!

#now aggregate the count matrix over these genes - omitting the NAs
cmat <- adata$X
colnames(cmat) <- features$features
duplicated_features <- na.omit(unique(features$features[duplicated(features$features)]))
cmat_duplicated <- cmat[, features$features %in% duplicated_features]
cmat_duplicated <- do.call(cbind, pblapply(duplicated_features, function(x){
rs <- Matrix::rowSums(cmat_duplicated[, colnames(cmat_duplicated) %in% x])
return(rs)
}))
colnames(cmat_duplicated) <- duplicated_features

#then bind this with the unique features..
cmat_nondup <- cmat[, !features$features %in% duplicated_features]
cmat_nondup <- cmat_nondup[, !is.na(colnames(cmat_nondup))]
cmat <- cbind(cmat_duplicated, cmat_nondup)

new_adata <- ad$AnnData(X = cmat)
new_adata$obs_names <- adata$obs_names
new_adata$var_names <- colnames(cmat)
new_adata$obs <- adata$obs

PaulaNietoG commented 3 years ago

Hi, Benjamin! Thanks a lot for taking some time to develop a workaround for this. Indeed this problem is not trivial and I have still not found the perfect solution, if I am not generating the count matrices myself. I have been using the HGNChelper R package that helps bypass some of this problems (not enough, sadly). I have also noticed that if you filter out genes expressed in less than 5, 10, 20... cells, most of these weird, duplicated genes disappear, even though I don't really like removing genes (especially in an atlas!). But my point is that the phenotype-defining markers are conserved, which is great to know. I'm still trying to find the perfect solution, so thanks a lot for your input and suggestions! 👍

igordot commented 2 years ago

Since the issue is now closed, should I assume the dataset is now final? The latest version of the Seurat object (August 2021) is up to 342,758 cells and 110,536 features.

Single-Cell-Genomics-Group-CNAG-CRG / Tumor-Immune-Cell-Atlas

number of genes #9