Add `"observation_expression_id"` to `obs`

pablo-gar commented 9 months ago

Motivation

Census users with modeling needs have expressed interest to identify true numerical duplicates for cells, as compared to is_primary_data which is curation-based and in some cases a cell that is not primary data is also not a true numerical duplicate.

In addition, this would help us identify errors of is_primary_data annotations where two or more duplicate cells have been annotated as `is_primary_data=True.

Proposed metadata variable

A variable in obs that creates an ID based in the numerical vector of counts per each cell. We have built a prototype in Census as to how implement this -- it boils down to hashing the numerical vectors for each cell.

See the prototype here. https://github.com/chanzuckerberg/cellxgene-census/blob/main/tools/cell_dup_check/finddups.ipynb

jahilton commented 9 months ago

Is there a reason that this isn't just kept as internal validation so that users can rely on is_primary_data curation?

pablo-gar commented 3 months ago

There are cells marked as is_primary_data = False that are not true numerical replicates of the counter-part primary cells. Modelers have expressed interest in including those for modeling, but currently there is no way to identify them easily.

jahilton commented 3 months ago

Can you clarify why modelers want to identify numerical replicates? Is it so they can avoid/filter them? Or something else?

brianraymor commented 3 months ago

@pablo-gar - is there a 1/2 pager or related census issue that motivated your prototype?

pablo-gar commented 3 months ago

After a call we decided on investigating more about the value of this field for external users to understand better how to best support the users.

I recommend to the following tasks that I can own:

Interview 3-5 users (modelers) who have expressed interest in getting access to data that are not numerical duplicates even if the come from the same biological source. The main questions we would like to ask are:
- What is your main use for these non-numerical duplicates? So far we have heard that it helps for batch-correction, and because more data is better.
- What is the level of granularity you are interested on? The two extremes of the spectrum are: 2 observations from the same cells that only differ by the expression of a single gene, 2 observations from the same cells that are numerically orthogonal
Characterize the level of granularity that exists in Census in term of similarity of observations, so that we can understand to what extent observation from the same cell vary. This work can be completed after @mlin completes https://github.com/chanzuckerberg/single-cell/issues/636.

Timeline:

DEFERRED DUE TO INTERNAL ORG CHANGES

1. Interviews 3-5 users (modelers).

~June 10-14~: Draft plan and questions, contact users
~June 17-28~: Execute interviews
~July 1-5~: Summarize results

2. Characterize the level of granularity.

~June 10-14~: Draft plan and spec
~June 17-28~: Write up analysis notebooks
~July 1-5~: Summarize results

ivirshup commented 1 month ago

I've been looking into this a bit, and wanted to add a couple points here:

Doc improvements

I think we should have a more nuanced and detailed discussion of what exactly the is_primary_data field means and why you may want to use it in the docs.

Some examples of when you may want to use this:

When you want to look at cell types which only show up when is_primary_data==False

Example

```python import cellxgene_census census = cellxgene_census.open_soma(census_version="2024-07-01") obs = cellxgene_census.get_obs(census, "homo_sapiens") obs.groupby("cell_type")["is_primary_data"].mean().sort_values().head(10) ``` ``` cell_type A2 amacrine cell 0.0 OFF retinal ganglion cell 0.0 ON retinal ganglion cell 0.0 stromal cell of lamina propria of small intestine 0.0 smooth muscle cell of small intestine 0.0 smooth muscle cell of large intestine 0.0 leptomeningeal cell 0.0 CD56-positive, CD161-positive immature natural killer cell, human 0.0 gut absorptive cell 0.0 type II NK T cell 0.0 ```

If are training a large model using labels, and the labels differ between studies. If you have no reason to trust one study over the other, you may want to just include the expression profile with both labels during training.

Computationally detecting cells

I haven't been able to run the finddups.ipynb notebook on recent tiledbsoma releases, so can't see the current status. However, I think there are multiple reasons to believe that it may have false positives

It hashes the data and indptr, while it should be hashing the data and indices. This means that the hash isn't actually taking into account which genes are expressed, just the level of expression for a set of unknown genes and number of genes expressed.
For low rna content cells/ low read depth assays I would expect false positives. E.g. if there are ~200 counts for a small set of genes, as may be the case for erythrocytes, some false positives would show up.

That said, I do think this is probably useful for identifying samples shared between datasets, just might be wrong on individual cells.

I think I would like to know which observations are actually the same cell, but I don't know that I think we can do that with high confidence at the moment. In large part this is due to the curation process.

chanzuckerberg / single-cell-curation