Closed signechambers1 closed 4 years ago
Prior art on gene sets:
geneontology.org was the first major driver project to associate genes with biological entities (biological process, cellular component, and molecular function). It started with model organisms but extended to human and mouse. There are web tools that use hypergeometric statistics to calculate the enrichment of a query set of genes and the genes associated with each ontology term.
msigdb.org extended these approaches by allowing genes in the set to be weighted, by for example, the strength of gene expression in a tissue.
Notes
#Note that Seurat finds both positive and negative markers (avg_diff either >0 or <0)"
The following differential expression tests are currently supported:
“wilcox” : Wilcoxon rank sum test (default) “bimod” : Likelihood-ratio test for single cell feature expression, (McDavid et al., Bioinformatics, 2013) “roc” : Standard AUC classifier “t” : Student’s t-test “poisson” : Likelihood ratio test assuming an underlying Poisson distribution. Use only for UMI-based datasets “negbinom” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets “LR” : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test. “MAST” : GLM-framework that treates cellular detection rate as a covariate (Finak et al, Genome Biology, 2015) (Installation instructions) “DESeq2” : DE based on a model using the negative binomial distribution (Love et al, Genome Biology, 2014) (Installation instructions)
- [Bias, robustness and scalability in single-cell differential expression analysis](https://www.nature.com/articles/nmeth.4612)
Notes from user requests
Genesets API example
{
genesets: [
{
geneset: covidGenes,
genes: [Apod, Cd7],
expressionMean: [65, 3456, 34, 56], /* vector of expression by cell by mean */
userExpressionMetrics: [45,322,324] /* vector of expression by cell by a custom metric, or obj of named vectors */
},
{...},
]
}
In the field of genomics (and more generally in bioinformatics), fold changes are defined directly in terms of ratios. If the initial value is A and the final value B, the fold change is defined as B/A. Note that this is different to the definition described above.In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "2-fold increase". Similarly, a change from 30 to 15 is referred to as a "2-fold decrease".In genomics, log ratios are often used for analysis and visualization of fold changes. The log2 (log with base 2) is most commonly used. For example, on a plot axis showing log2-fold-changes, an 8-fold increase will be displayed at an axis value of 3 (since 2^3 = 8).
Scanpy rank gene groups return value is an anndata object
Returns
namesstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the gene names. Ordered according to scores.
scoresstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.
logfoldchangesstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.
pvalsstructured np.ndarray (.uns['rank_genes_groups'])
p-values.
pvals_adjstructured np.ndarray (.uns['rank_genes_groups'])
Corrected p-values.
ptspandas.DataFrame (.uns['rank_genes_groups'])
Fraction of cells expressing the genes for each group.
pts_restpandas.DataFrame (.uns['rank_genes_groups'])
Only if reference is set to 'rest'. Fraction of cells from the union of the rest of each group expressing the genes.
Seurat find markers return values
The results data frame has the following columns :
p_val : p_val (unadjusted)
avg_logFC : log fold-chage of the average expression between the two groups. Positive values indicate that the feature is more highly expressed in the first group.
pct.1 : The percentage of cells where the feature is detected in the first group
pct.2 : The percentage of cells where the feature is detected in the second group
p_val_adj : Adjusted p-value, based on bonferroni correction using all features in the dataset.
If the ident.2 parameter is omitted or set to NULL, FindMarkers will test for differentially expressed features between the group specified by ident.1 and all other cells.
The data format for gene sets should also support the export of differential expression results and a user uploading their own distribution for a gene set. If there is a standard format for differential expression output, we should follow that. If not, the format should be as straightforward as possible while serving these use cases.
To do this we should:
We may want to include the following (bolded fields are required, all else optional):
User requests for both p-value and log fold change are [here] (https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/chanzuckerberg/cellxgene/989).