chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
64 stars 14 forks source link

feat(wmg): update CellGuide marker gene pipeline with more robust statistics for calculating marker scores #5868

Closed atarashansky closed 1 year ago

atarashansky commented 1 year ago

EDIT: This work has revealed that discrepancies between WMG and CellGuide were due to overly sensitive statistics for calculating marker scores. So this ticket has been changed to resolving those issues in CellGuide so the ultimate goal of sharing the marker genes between WMG and CellGuide can be achieved.

CellGuide marker gene pipeline runs in less than an hour, whereas the one used by the WMG builder takes upwards of 14 hours to complete. With the most up-to-date data consumed by WMG, the pipeline in total takes around 20 hours to complete. While pipeline optimization is usually low-priority considering that it is not user-facing, it would greatly improve our pipeline development cycle.

Below is a list of all the differences between the CellGuide and WMG marker gene pipelines:

In any case, a thorough examination of the differences in reported marker genes between the two pipelines is warranted before initiating this work. The analysis should include a distribution of marker scores for genes that are identified in one pipeline and not the other. If this distribution tends to be on the lower end of marker scores, then the differences in reported marker genes are not practically meaningful and can be safely ignored.

atarashansky commented 1 year ago

Key takeaways so far:

Private Zenhub Image

atarashansky commented 1 year ago

I investigated similarities and differences between the WMG FMG pipeline and the CellGuide computational marker gene pipeline. I observed significant differences, much more than what was expected from what were presumably really similar algorithms. After careful debugging, I discovered that the root source of all differences between the two pipelines was a specific step in CellGuide where I augment the cell counts dataframe with missing ancestors that are then rescued with a rollup operation.

As a concrete example, Regular Ventricular Cardiac Myocyte in Heart has 52 cell types displayed in WMG. After augmenting this data with ancestors, there are 118 cell types. For the top marker gene in FMG RYR2, the marker score (5th percentile of effect sizes) for the 52 original cell types is 2.43. The marker score for the 118 total cell types after augmentation is 0.4. RYR2 is a good marker for Regular Ventricular Cardiac Myocyte, but its marker score is below our threshold for filtering out genes. I confirmed that the numbers used by both pipelines are exactly the same for the original 52 cell types. The difference is in the number of cell types aggregated. In this example, to achieve comparable results to FMG, the aggregation percentile in CellGuide needs to be increased to ~15%.

image

Note that there are closely related cell types that also express RYR2 (the histogram bars near 0 above). When augmenting the data, there are more such cell types (second plot). The 5th percentile of the data all of a sudden "jumps" to be close to 0. This indicates that the 5th percentile is a pretty noisy measure of the data, because the shape of the histograms look pretty similar in both plots. I think a simple solution is to simply increase the percentile used in CellGuide to 15%.

Note: we cannot remove the augmentation step from CellGuide because then we would be missing computational marker genes for ~300 cell cards.

atarashansky commented 1 year ago

I realized that the way I was computing gene overlap before was incorrect. Previously, I was calculating the gene overlap $G$ to be: $$G = \frac{S_1 \cap S_2}{\max(|S_1|,|S_2|)}$$

This penalizes cases where $S_1$ and $S_2$ contain the same marker genes but one contains additional genes. What we really need is to make sure that $S_1$ and $S_2$ contain the same genes: $$G = \frac{S_1 \cap S_2}{\min(|S_1|,|S_2|)}$$

Using the new formula we get the following gene overlap scores: Percentile: 5%

image

Percentile: 15%

image

5% looks good now, but 15% looks way better. Keep in mind, though that when using the original formula $$G = \frac{S_1 \cap S_2}{\max(|S_1|,|S_2|)}$$ 15% looks way worse, because it's simply finding more marker genes (the marker score threshold of 0.5 is too low for 15%). See below figures

5%: image

15%:

image

Simply increasing the marker score threshold when using the 15th percentile does not improve the situation as there will be some marker gene sets that become more similar in size to that of FMG but others become less similar.

The conclusion is that the CellGuide and FMG marker gene pipelines are finding similar marker genes. However, I identified edge cases where the 5% percentile rule doesn't work. Ideally, we would come up with an aggregation heuristic that is robust to the edge case described in the above comment.