When running BioQC on an expression set, take only rows with gene symbol as background.

grst commented 7 years ago

because of the reasons discussed...

I'm not sure what should be the default behavior, though. But taking all genes as background could result in misleading results and significant 'over-estimation' of signatures.

grst commented 7 years ago

This function can be used to filter expression sets accordingly:

#' Filter genes with annotation from expression set. 
#' 
#' This is done to have the correct background for BioQC.
#' Taking all probeids as background is a bias towards each signature in general 
#' as probeids with gene symbol tend to be higher expressed in general. 
filter_eset = function(eset) {
  gene_symbols = fData(eset)$BioqcGeneSymbol
  # remove lines that have no gene set
  eset = eset[(!is.na(gene_symbols)) & gene_symbols != '-',]
  eset = eset[keepMaxStatRowInd(exprs(eset), fData(eset)$BioqcGeneSymbol),]
  return(eset)
}

grst commented 7 years ago

Also, we need to check that the gene symbols are valid hgnc symbols.

My current version of this function looks like this:

#' Filter genes with annotation from expression set. 
#' 
#' This is done to have the correct background for BioQC.
#' Taking all probeids as background is a bias towards each signature in general 
#' as probeids with gene symbol tend to be higher expressed in general. 
filter_eset = function(eset) {
  gene_symbols = fData(eset)$BioqcGeneSymbol
  hgnc_symbols = read_tsv("results/hgnc_symbols.tsv")
  # remove lines that have no gene set
  eset = eset[(!is.na(gene_symbols)) & (gene_symbols != '-') & (gene_symbols %in% hgnc_symbols$hgnc_symbols),]
  eset = eset[keepMaxStatRowInd(exprs(eset), fData(eset)$BioqcGeneSymbol),]
  return(eset)
}

where hgnc_symbols.tsv is derived from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt

Accio / BioQC

When running BioQC on an expression set, take only rows with gene symbol as background. #13