Define gene sets data format

signechambers1 commented 4 years ago

The data format for gene sets should also support the export of differential expression results and a user uploading their own distribution for a gene set. If there is a standard format for differential expression output, we should follow that. If not, the format should be as straightforward as possible while serving these use cases.

To do this we should:

[x] check how scanpy stores differential expression results
[x] check how seurat stores differential expression results
[ ] write RFC with proposed gene set format
[ ] review
[ ] sign off

We may want to include the following (bolded fields are required, all else optional):

Geneset Name
Gene Name
Evidence for an individual gene (requested by Valentine - may be text "this is why I included this gene" or a test statistic, or log fold change)
Test power (dependent on evidence being a test statistic)
Custom distribution (numeric field - requested by Gokcen)

User requests for both p-value and log fold change are [here] (https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/chanzuckerberg/cellxgene/989).

ambrosejcarr commented 4 years ago

Prior art on gene sets:

geneontology.org was the first major driver project to associate genes with biological entities (biological process, cellular component, and molecular function). It started with model organisms but extended to human and mouse. There are web tools that use hypergeometric statistics to calculate the enrichment of a query set of genes and the genes associated with each ontology term.

msigdb.org extended these approaches by allowing genes in the set to be weighted, by for example, the strength of gene expression in a tissue.

MDunitz commented 4 years ago

Notes

[useful paper describing single cell sequencing and analysis process] (https://www.embopress.org/doi/full/10.15252/msb.20188746)
[notebook for scanpy differential expression] (https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_05_dge.html)
scanpy rank_gene_groups
scanpy allows for identification of marker genes for clusters via differential expression tests and pseudotemporal ordering via diffusion pseudotime article
[The framework FLOTILLA] (https://github.com/yeolab/flotilla), comes with modules for simple visualization, simple clustering, and differential expression testing
CZI workshop of diff expression
Seurat3 directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures function. By default, we return 2,000 features per dataset
- "We used the logistic regression differential expression test85 implemented in the FindMarkers function in Seurat, with the donor as a latent variable (latent.vars="orig.ident", test.use="LR"). We retained the top 25 differentially expressed genes based on highest fold-change expression"
- " To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq288 and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.01"
- "For finding differentially accessible peaks between groups of cells, we used the binarized peak count matrix. The authors instruct that scATAC-seq gene activity score matrix must be preprocessed and filtered, so we applied log-CPM (counts-per-million) normalization, and removed cells with less than 5,000 total peaks detected in the binary peak matrix."
- "We identified differentially accessible peaks between groups of scATAC-seq cells simply by ordering peaks by their fold-change accessibility between the groups, and retaining the top 1,000 peaks that displayed the greatest fold-change in accessibility. We searched for overrepresented DNA sequence motifs in accessible regions using the Homer package94, using the findMotifsGenome.pl program with default parameters, and the mm9 genome."
- "We then performed differential expressed to identify gene expression markers that were upregulated in each classified cell type. We used a logistic regression test for differential expression85 on the uncorrected data with replicate as a latent variable, implemented in the FindMarkers function in Seurat (method="LR", latent.vars="orig.ident", assay="RNA")."
- mean-variance dispersion
Seurat feature selection notebook
**Seurat clustering tutorial
- "find all markers of cluster 8 thresh.use speeds things up (increase value to increase speed) by only testing genes whose average expression is > thresh.use between cluster
```
#Note that Seurat finds both positive and negative markers (avg_diff either >0 or <0)"
```
- " note that Seurat has four tests for differential expression:
  ROC test ("roc"), t-test ("t"), LRT test based on zero-inflated data ("bimod", default), LRT test based on tobit-censoring models ("tobit")
  
  The ROC test returns the 'classification power' for any individual marker (ranging from 0 - random, to 1 - perfect). Though not a statistical test, it is often very useful for finding clean markers."
Seurat differential expression
- "The bulk of Seurat’s differential expression features can be accessed through the FindMarkers function. As a default, Seurat performs differential expression based on the non-parameteric Wilcoxon rank sum test. This replaces the previous default test (‘bimod’). To test for differential expression between two specific groups of cells, specify the ident.1 and ident.2 parameters."
- Seurat supports the following diff expression tests
```
The following differential expression tests are currently supported:
```

“wilcox” : Wilcoxon rank sum test (default) “bimod” : Likelihood-ratio test for single cell feature expression, (McDavid et al., Bioinformatics, 2013) “roc” : Standard AUC classifier “t” : Student’s t-test “poisson” : Likelihood ratio test assuming an underlying Poisson distribution. Use only for UMI-based datasets “negbinom” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets “LR” : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test. “MAST” : GLM-framework that treates cellular detection rate as a covariate (Finak et al, Genome Biology, 2015) (Installation instructions) “DESeq2” : DE based on a model using the negative binomial distribution (Love et al, Genome Biology, 2014) (Installation instructions)


- [Bias, robustness and scalability in single-cell differential expression analysis](https://www.nature.com/articles/nmeth.4612)

MDunitz commented 4 years ago

Notes from user requests

"Could we include two additional columns, "fold change" and "p-value" which include the fold change and p-value as are currently included with the histogram plots"

Genesets API example

{
  genesets: [
    {
       geneset: covidGenes, 
       genes: [Apod, Cd7], 
       expressionMean: [65, 3456, 34, 56], /* vector of expression by cell by mean */
       userExpressionMetrics: [45,322,324] /* vector of expression by cell by a custom metric, or obj of named vectors */
    },
    {...},
  ]
}

Explanation of log fold change

In the field of genomics (and more generally in bioinformatics), fold changes are defined directly in terms of ratios. If the initial value is A and the final value B, the fold change is defined as B/A. Note that this is different to the definition described above.In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "2-fold increase". Similarly, a change from 30 to 15 is referred to as a "2-fold decrease".In genomics, log ratios are often used for analysis and visualization of fold changes. The log2 (log with base 2) is most commonly used. For example, on a plot axis showing log2-fold-changes, an 8-fold increase will be displayed at an axis value of 3 (since 2^3 = 8).

MDunitz commented 4 years ago

Scanpy rank gene groups return value is an anndata object

Returns
namesstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the gene names. Ordered according to scores.

scoresstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.

logfoldchangesstructured np.ndarray (.uns['rank_genes_groups'])
Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.

pvalsstructured np.ndarray (.uns['rank_genes_groups'])
p-values.

pvals_adjstructured np.ndarray (.uns['rank_genes_groups'])
Corrected p-values.

ptspandas.DataFrame (.uns['rank_genes_groups'])
Fraction of cells expressing the genes for each group.

pts_restpandas.DataFrame (.uns['rank_genes_groups'])
Only if reference is set to 'rest'. Fraction of cells from the union of the rest of each group expressing the genes.

MDunitz commented 4 years ago

Seurat find markers return values

The results data frame has the following columns :

p_val : p_val (unadjusted)
avg_logFC : log fold-chage of the average expression between the two groups. Positive values indicate that the feature is more highly expressed in the first group.
pct.1 : The percentage of cells where the feature is detected in the first group
pct.2 : The percentage of cells where the feature is detected in the second group
p_val_adj : Adjusted p-value, based on bonferroni correction using all features in the dataset.
If the ident.2 parameter is omitted or set to NULL, FindMarkers will test for differentially expressed features between the group specified by ident.1 and all other cells.

chanzuckerberg / cellxgene

Define gene sets data format #1860

ROC test ("roc"), t-test ("t"), LRT test based on zero-inflated data ("bimod", default), LRT test based on tobit-censoring models ("tobit")

The ROC test returns the 'classification power' for any individual marker (ranging from 0 - random, to 1 - perfect). Though not a statistical test, it is often very useful for finding clean markers."

Explanation of log fold change