AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Exploratory notebook for defining consensus cell types #889

Open allyhawkins opened 1 week ago

allyhawkins commented 1 week ago

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

853

Describe the goals of the changes to the analysis module.

Once we identify a tool/ tools to use for identifying related ontology terms, we will want to explore using that tool to define consensus cell types with ScPCA samples. I think to start we want to take one sample that we feel confident has normal cells identified by both SingleR and CellAssign and try and define some basic rules for consensus cell typing. Then we can expand to a subset of samples.

What will your pull request contain?

This should be an exploratory notebook that looks at identifying consensus cell types. It's possible that once we get started on this we may want to break this issue up, but I'm filing this now as a place holder for completing the initial exploration.

Will you require additional software beyond what is already in the analysis module?

Whatever tools we choose to use will be added.

Will you require different computational resources beyond what the analysis module already uses?

TBD

If known, when do you expect to file the pull request?

No response

allyhawkins commented 2 days ago

From https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/888#issuecomment-2501750852:

I think a good first step is to take one of the samples where we know there are some shared cell types and some not shared cell types between SingleR/CellAssign and calculate this similarity index on all of the cells. We would then plot the distribution and validate that cells, where we expect them to be matches, have high similarity values and those that should not match have low similarity values. For example, I would expect a high similarity in the scenario where one method labels the cell as a T cell and the other method labels the cell as a CD4 T cell. But I would not expect cells to match where one is a T cell and the other labels the cell as a fibroblast.

If we proceed with using this metric, we will need to come up with some sort of threshold that we can use to classify cells as sharing the label vs. not sharing the label. One thought I had was to do permutation testing where we randomly select ontology terms and get a distribution of the similarity index. Then we can test if the observed similarity index is significant compared to the null distribution of values. Those with significant values would then receive a consensus label. I believe this is the approach that is taken by ontologySimilarity::get_sim_p() but the documentation on that function is not super clear so we would need to look into it further.

We will also need to figure out how to actually assign the label to cells that are classified as similar. I think we would want to take the closest shared ancestor and use that ontology ID.

See also https://github.com/AlexsLemonade/OpenScPCA-analysis/issues/888#issuecomment-2501756400.

In addition to looking at the similarity index it might be helpful to look only at parent terms. One of the things that ontologyIndex stores is the direct parent term, so we could see if terms share a parent term. Any ids that share a parent term could then receive that parent term as the label.

I'm not planning on starting this until ontologies are finished being updated, but I just wanted to jot down my initial ideas after exploring tools in #888.

jashapiro commented 1 day ago

In addition to looking at the similarity index it might be helpful to look only at parent terms. One of the things that ontologyIndex stores is the direct parent term, so we could see if terms share a parent term. Any ids that share a parent term could then receive that parent term as the label.

Do you mean if one is a parent of the other? Sharing a parent doesn't necessarily seem like a good measure on its own. For example B cells and T cells would share a "lymphocyte" parent, but we would definitely not want to classify those as the same. However, if one method classified as "T cell" and another as "effector T cell", I would say those are "the same", or at least "compatible".

Notably, for that last example, you would have to go up two levels from "effector T cell" to "T cell". http://purl.obolibrary.org/obo/CL_0000911, which I want to make sure you are able to do. Which is to say that I think you would want to look at the full list of parents and see whether the cell type annotation from method A was present in the ancestors of the cell type from method B, and whether annotation from method B was present in the ancestors of A's cell types.