Open grst opened 7 years ago
This function can be used to filter expression sets accordingly:
#' Filter genes with annotation from expression set.
#'
#' This is done to have the correct background for BioQC.
#' Taking all probeids as background is a bias towards each signature in general
#' as probeids with gene symbol tend to be higher expressed in general.
filter_eset = function(eset) {
gene_symbols = fData(eset)$BioqcGeneSymbol
# remove lines that have no gene set
eset = eset[(!is.na(gene_symbols)) & gene_symbols != '-',]
eset = eset[keepMaxStatRowInd(exprs(eset), fData(eset)$BioqcGeneSymbol),]
return(eset)
}
Also, we need to check that the gene symbols are valid hgnc symbols.
My current version of this function looks like this:
#' Filter genes with annotation from expression set.
#'
#' This is done to have the correct background for BioQC.
#' Taking all probeids as background is a bias towards each signature in general
#' as probeids with gene symbol tend to be higher expressed in general.
filter_eset = function(eset) {
gene_symbols = fData(eset)$BioqcGeneSymbol
hgnc_symbols = read_tsv("results/hgnc_symbols.tsv")
# remove lines that have no gene set
eset = eset[(!is.na(gene_symbols)) & (gene_symbols != '-') & (gene_symbols %in% hgnc_symbols$hgnc_symbols),]
eset = eset[keepMaxStatRowInd(exprs(eset), fData(eset)$BioqcGeneSymbol),]
return(eset)
}
where hgnc_symbols.tsv is derived from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
because of the reasons discussed...
I'm not sure what should be the default behavior, though. But taking all genes as background could result in misleading results and significant 'over-estimation' of signatures.