chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Add `"observation_expression_id"` to `obs` #703

Open pablo-gar opened 9 months ago

pablo-gar commented 9 months ago

Motivation

Census users with modeling needs have expressed interest to identify true numerical duplicates for cells, as compared to is_primary_data which is curation-based and in some cases a cell that is not primary data is also not a true numerical duplicate.

In addition, this would help us identify errors of is_primary_data annotations where two or more duplicate cells have been annotated as `is_primary_data=True.

Proposed metadata variable

A variable in obs that creates an ID based in the numerical vector of counts per each cell. We have built a prototype in Census as to how implement this -- it boils down to hashing the numerical vectors for each cell.

See the prototype here. https://github.com/chanzuckerberg/cellxgene-census/blob/main/tools/cell_dup_check/finddups.ipynb

jahilton commented 9 months ago

Is there a reason that this isn't just kept as internal validation so that users can rely on is_primary_data curation?

pablo-gar commented 3 months ago

There are cells marked as is_primary_data = False that are not true numerical replicates of the counter-part primary cells. Modelers have expressed interest in including those for modeling, but currently there is no way to identify them easily.

jahilton commented 3 months ago

Can you clarify why modelers want to identify numerical replicates? Is it so they can avoid/filter them? Or something else?

brianraymor commented 3 months ago

@pablo-gar - is there a 1/2 pager or related census issue that motivated your prototype?

pablo-gar commented 3 months ago

After a call we decided on investigating more about the value of this field for external users to understand better how to best support the users.

I recommend to the following tasks that I can own:

  1. Interview 3-5 users (modelers) who have expressed interest in getting access to data that are not numerical duplicates even if the come from the same biological source. The main questions we would like to ask are:
    • What is your main use for these non-numerical duplicates? So far we have heard that it helps for batch-correction, and because more data is better.
    • What is the level of granularity you are interested on? The two extremes of the spectrum are: 2 observations from the same cells that only differ by the expression of a single gene, 2 observations from the same cells that are numerically orthogonal
  2. Characterize the level of granularity that exists in Census in term of similarity of observations, so that we can understand to what extent observation from the same cell vary. This work can be completed after @mlin completes https://github.com/chanzuckerberg/single-cell/issues/636.

Timeline:

DEFERRED DUE TO INTERNAL ORG CHANGES

1. Interviews 3-5 users (modelers).

2. Characterize the level of granularity.

ivirshup commented 1 month ago

I've been looking into this a bit, and wanted to add a couple points here:

Doc improvements

I think we should have a more nuanced and detailed discussion of what exactly the is_primary_data field means and why you may want to use it in the docs.

Some examples of when you may want to use this:

Example ```python import cellxgene_census census = cellxgene_census.open_soma(census_version="2024-07-01") obs = cellxgene_census.get_obs(census, "homo_sapiens") obs.groupby("cell_type")["is_primary_data"].mean().sort_values().head(10) ``` ``` cell_type A2 amacrine cell 0.0 OFF retinal ganglion cell 0.0 ON retinal ganglion cell 0.0 stromal cell of lamina propria of small intestine 0.0 smooth muscle cell of small intestine 0.0 smooth muscle cell of large intestine 0.0 leptomeningeal cell 0.0 CD56-positive, CD161-positive immature natural killer cell, human 0.0 gut absorptive cell 0.0 type II NK T cell 0.0 ```

Computationally detecting cells

I haven't been able to run the finddups.ipynb notebook on recent tiledbsoma releases, so can't see the current status. However, I think there are multiple reasons to believe that it may have false positives

That said, I do think this is probably useful for identifying samples shared between datasets, just might be wrong on individual cells.


I think I would like to know which observations are actually the same cell, but I don't know that I think we can do that with high confidence at the moment. In large part this is due to the curation process.