Closed rbutleriii closed 7 months ago
@bschilder any idea what this keep popular from orthogene::convert_orthologs
is doing and the expected behaviour for EWCE?
looks like it calls gprofiler for that and I assume does something like this:
y <- data.table(
gorth(
query = specs$GENE,
source_organism = "mmusculus",
target_organism = "hsapiens",
mthreshold = 1,
filter_na = TRUE
)
)
I have looked at their web documentation, and I don't know how they pick favorites, but in my case it would be very helpful to have an ortholog resolution that keeps one of the 1:many
, as using only 1:1
orths drops plenty of key genes. Notably in my case Mt1
is pretty biologically relevant to oligodendrocytes, but results in 10 duplicate human orthologs.
Currently I tend to leave the duplicates, knowing that this does produce duplicates, but i am not sure biomaRt has anything specifying a "best" either.
#' For a given column of mouse ensIDs, replace them with human ensembl gene ids
#' includes a filter for only no pseudo or TEC. Currently still locked at v105
#' of biomaRt until fix for https://github.com/grimbough/biomaRt/issues/61
#'
#' @param dt data table to replace names
#' @param colname name of the column containing mouse ensgs
#' @param keep.mouse boolean, retain the mouse ENSG column?
#'
#' @returns dt with a GENE column with or without the mGENE column
convertMouseHuman = function(dt, colname, keep.mouse=F){
require("biomaRt")
print(paste(length(unique(dt[[colname]])), "genes to look up"))
mouse = useEnsembl("ensembl", dataset="mmusculus_gene_ensembl", version="105")
human = useEnsembl("ensembl", dataset="hsapiens_gene_ensembl", version="105")
# match with ensembl_gene_id across both
x = data.table(getLDS(mart=mouse, attributes=c('ensembl_gene_id'), martL=human,
attributesL=c('ensembl_gene_id', 'chromosome_name',
'gene_biotype'),
filters='ensembl_gene_id', values=unique(dt[[colname]]),
bmHeader=F))
names(x) = make.unique(names(x))
print(paste(nrow(x), "total genes returned"))
# ensembl transcript types to keep, excluding pseudogenes and TECs
keep_categ = grep("pseudo", unique(x$gene_biotype), value=T, invert=T)
keep_categ = grep("TEC", keep_categ, value=T, invert=T)
x = x[gene_biotype %in% keep_categ]
print(paste(nrow(x), "genes correct biotype"))
# use only canonical chromosomes
use_chrs = c(1:22, "X", "Y", "MT")
x = x[chromosome_name %in% use_chrs]
print(paste(nrow(x), "genes in chr 1:22 X Y MT"))
setnames(x, c("ensembl_gene_id", "ensembl_gene_id.1"), c("mGENE", "GENE"))
# replace gene names and write to file
dt = merge(x[, .(GENE, mGENE)], dt, all.y=T, by.x="mGENE", by.y=colname)
dt = dt[!is.na(GENE)]
print(paste(length(unique(dt[["GENE"]])), "unique genes with ensembl ids"))
if (keep.mouse == FALSE) dt[, mGENE := NULL]
return(unique(dt))
}
#' For a given column of Gene Symbols, replace them with Ensembl IDs
#' includes a filter for only protein coding genes
#'
#' @param dt data table to replace names
#' @param colname name of the column containing ensgs
#' @param keep.ensgs boolean, retain the ENSG column?
#'
#' @returns dt with a GENE column with or without the Symbol column
convertGeneSymbol = function(dt, colname, keep.symbols=F){
require("biomaRt")
print(paste(length(unique(dt[[colname]])), "genes to look up"))
ensembl = useEnsembl("ensembl", dataset="mmusculus_gene_ensembl", version=ensver)
# match with ensembl_gene_id across both
x = data.table(getBM(attributes=c("ensembl_gene_id", "external_gene_name",
"gene_biotype", "chromosome_name"),
filters="external_gene_name",
values=unique(dt[[colname]]),
mart=ensembl))
print(paste(nrow(x), "total genes returned"))
# ensembl transcript types to keep, excluding pseudogenes and TECs
keep_categ = grep("pseudo", unique(x$gene_biotype), value=T, invert=T)
keep_categ = grep("TEC", keep_categ, value=T, invert=T)
x = x[gene_biotype %in% keep_categ]
print(paste(nrow(x), "genes correct biotype"))
# use only canonical chromosomes
use_chrs = c(1:19, "X", "Y", "MT")
x = x[chromosome_name %in% use_chrs]
print(paste(nrow(x), "genes in chr 1:19, X, Y, MT"))
setnames(x, c("ensembl_gene_id", "external_gene_name"), c("GENE", "Symbol"))
# replace gene names and write to file
dt = merge(x[, .(GENE, Symbol)], dt, all.y=T, by.x="Symbol", by.y=colname)
dt = dt[!is.na(GENE)]
print(paste(length(unique(dt[["GENE"]])), "unique genes with ensembl ids"))
if (keep.symbols == FALSE) dt[, Symbol := NULL]
return(unique(dt))
}
Looking into this more now, this seems like more of a bug for orthogene. Yes, you would think the 'keep popular' option would return just one gene in the mapping but it doesn't. EWCE uses orthogene for this mapping and you can see that drop_non121() doesn't have a method to handle keep_popular (kp) so it is assumed int he code that there will be a 1:1 return for this which there isn't. @rbutleriii could you create a bug issue on orthogene and reference this instead?
@rbutleriii I would indeed expect keep_popular to return only 1:1 genes, which should be compatible with EWCE
.
https://github.com/neurogenomics/orthogene/blob/bc242c50396018d55fd12e653c0c069bc34dca67/R/check_keep_popular.R#L1
I think logging this as an issue (with example data) would be the best thing to do. I can then look into it there.
Ok, so that took a bit, but eventually I can now reproduce it without the dataset. Turns out the error is not specific to orthogene. It appears to just be that the expression matrix was a data.frame, not a matrix:
# library(data.table)
library(EWCE)
# library(orthogene)
library(gprofiler2)
mGenes = c(
"Hjurp", "6430706D22Rik", "A730008H23Rik", "Arl4c", "Sh3bp4"
)
set.seed(1)
expr = data.frame(matrix(
sample(100, length(mGenes) * 9, replace = TRUE),
nrow = length(mGenes)
))
rownames(expr) = mGenes
ctype = unlist(lapply(c("A", "B", "C"), rep, 3))
x <- drop_uninformative_genes(
exp = expr,
level2annot = ctype,
convert_orths = TRUE,
non121_strategy = "keep_popular",
input_species = "mouse",
output_species = "human"
)
If instead you load it in and keep it a matrix, it runs fine.
expr = matrix(
sample(100, length(mGenes) * 9, replace = TRUE),
nrow = length(mGenes)
)
Thanks for making the reprex, @rbutleriii . Will close this unless there's anything else we can help with.
Thanks @rbutleriii for figuring this out! I actually think this is something EWCE should handle too though so I have pushed a fix sodrop_uninformative_genes()
will now catch cases where expression matrix is a dataframe and convert to a matrix. See version >=1.11.4 for this.
Trying to use the
keep_popular
setting indrop_uninformative_genes
, and I get an error:Looks like it returns duplicates. here are the first 500 of my gene names, looks like
Hjurp
is one of them that causes the error.