feat(wmg): update CellGuide marker gene pipeline with more robust statistics for calculating marker scores

EDIT: This work has revealed that discrepancies between WMG and CellGuide were due to overly sensitive statistics for calculating marker scores. So this ticket has been changed to resolving those issues in CellGuide so the ultimate goal of sharing the marker genes between WMG and CellGuide can be achieved.

CellGuide marker gene pipeline runs in less than an hour, whereas the one used by the WMG builder takes upwards of 14 hours to complete. With the most up-to-date data consumed by WMG, the pipeline in total takes around 20 hours to complete. While pipeline optimization is usually low-priority considering that it is not user-facing, it would greatly improve our pipeline development cycle.

Below is a list of all the differences between the CellGuide and WMG marker gene pipelines:

Dask is used to parallelize the CellGuide marker gene pipeline
CellGuide marker gene pipeline loads all required data in memory beforehand, whereas WMG requires repeated TileDB queries to the expression summary cube.
WMG marker gene pipeline code complexity is much higher as it accounts for a small edge case where some genes may not have been measured in every dataset. This also requires a specific data artifact (dataset_to_gene_ids.json). CellGuide does not account for this, considering that robust marker genes should be measured in most datasets anyway.

In any case, a thorough examination of the differences in reported marker genes between the two pipelines is warranted before initiating this work. The analysis should include a distribution of marker scores for genes that are identified in one pipeline and not the other. If this distribution tends to be on the lower end of marker scores, then the differences in reported marker genes are not practically meaningful and can be safely ignored.

Key takeaways so far:

The distribution of % gene overlap between marker genes for every <cell type, tissue> group is skewed right with its center-of-mass around ~75% overlap.
The average marker scores for overlapping genes tends to be larger than the average marker scores for non-overlapping genes (median of distribution is 0.9 vs 0.6).

Private Zenhub Image

I investigated similarities and differences between the WMG FMG pipeline and the CellGuide computational marker gene pipeline. I observed significant differences, much more than what was expected from what were presumably really similar algorithms. After careful debugging, I discovered that the root source of all differences between the two pipelines was a specific step in CellGuide where I augment the cell counts dataframe with missing ancestors that are then rescued with a rollup operation.

As a concrete example, Regular Ventricular Cardiac Myocyte in Heart has 52 cell types displayed in WMG. After augmenting this data with ancestors, there are 118 cell types. For the top marker gene in FMG RYR2, the marker score (5th percentile of effect sizes) for the 52 original cell types is 2.43. The marker score for the 118 total cell types after augmentation is 0.4. RYR2 is a good marker for Regular Ventricular Cardiac Myocyte, but its marker score is below our threshold for filtering out genes. I confirmed that the numbers used by both pipelines are exactly the same for the original 52 cell types. The difference is in the number of cell types aggregated. In this example, to achieve comparable results to FMG, the aggregation percentile in CellGuide needs to be increased to ~15%.

Note that there are closely related cell types that also express RYR2 (the histogram bars near 0 above). When augmenting the data, there are more such cell types (second plot). The 5th percentile of the data all of a sudden "jumps" to be close to 0. This indicates that the 5th percentile is a pretty noisy measure of the data, because the shape of the histograms look pretty similar in both plots. I think a simple solution is to simply increase the percentile used in CellGuide to 15%.

Note: we cannot remove the augmentation step from CellGuide because then we would be missing computational marker genes for ~300 cell cards.

I realized that the way I was computing gene overlap before was incorrect. Previously, I was calculating the gene overlap $G$ to be: $$G = \frac{S_1 \cap S_2}{\max(|S_1|,|S_2|)}$$

This penalizes cases where $S_1$ and $S_2$ contain the same marker genes but one contains additional genes. What we really need is to make sure that $S_1$ and $S_2$ contain the same genes: $$G = \frac{S_1 \cap S_2}{\min(|S_1|,|S_2|)}$$

Using the new formula we get the following gene overlap scores: Percentile: 5%

Percentile: 15%

5% looks good now, but 15% looks way better. Keep in mind, though that when using the original formula $$G = \frac{S_1 \cap S_2}{\max(|S_1|,|S_2|)}$$ 15% looks way worse, because it's simply finding more marker genes (the marker score threshold of 0.5 is too low for 15%). See below figures

5%:

15%:

Simply increasing the marker score threshold when using the 15th percentile does not improve the situation as there will be some marker gene sets that become more similar in size to that of FMG but others become less similar.

The conclusion is that the CellGuide and FMG marker gene pipelines are finding similar marker genes. However, I identified edge cases where the 5% percentile rule doesn't work. Ideally, we would come up with an aggregation heuristic that is robust to the edge case described in the above comment.

chanzuckerberg / single-cell-data-portal

feat(wmg): update CellGuide marker gene pipeline with more robust statistics for calculating marker scores #5868