chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Add tissue_type #240

Closed brianraymor closed 1 year ago

brianraymor commented 2 years ago

Context

See Should organoids and cell cultures be updated or removed?


Design

obs (Cell Metadata)

obs is a pandas.DataFrame.

Curators MUST annotate the following columns in the obs dataframe:

tissue_type

Key tissue_type
Annotator Curator
Value categorical with str categories. This MUST be "tissue", "organoid", or "cell culture".


tissue_ontology_term_id

Key tissue_ontology_term_id
Annotator Curator
Value categorical with str categories. If tissue_type is "tissue" or "organoid", this MUST be the most accurate child of UBERON:0001062 for anatomical entity.

If tissue_type is "cell culture" this MUST follow the requirements for cell_type_ontology_term_id.


...

When a dataset is uploaded, CELLxGENE DIscover MUST automatically add the matching human-readable name for the corresponding ontology term to the obs dataframe. Curators MUST NOT annotate the following columns.

tissue

Key tissue
Annotator CELLxGENE Discover
Value categorical with str categories. This MUST be the human-readable name assigned to the value of tissue_ontology_term_id.



Appendix A. Changelog

schema v4.0.0

jahilton commented 2 years ago

Outdated What does c(primary, established) mean? Is there a reason Tissue is not lowercase? Is part of the proposal to no longer append tissue_ontology_term_id/tissue values with (organoid)/(cell culture)?

For alignment of the two properties, I would propose tissue_ontology_term_id --> sample_ontology_term_id This also avoids annotating tissue_* information for non-tissue data.

BAevermann commented 2 years ago

Outdated -The primary/established were referring to type of cell cultures we might want to capture in more detail. -I fixed the capital T in tissue. -This would be an alternative to appending it to the Uberon term. From the outreach conducted thus far, we are unique in representing data this way.

I support the renaming to "sample_ontology_term_id". In this case would we propose adding CLO ids to the list of valid values in support of established cultures?

brianraymor commented 2 years ago

Outdated Why would we ask curators to capture primary or established? Can you speak more to the subsequent value for data consumers? Generally, it would be helpful to share use cases or background as @jahilton did in this case to further my education.

BAevermann commented 2 years ago

Outdated Primary cell lines and Established cell lines test very different biological hypotheses.

Primary cell lines are cells that were taken from a donor and cultured for a limited time. These can be used to test things like a disease mechanism in a specific cell type from a patient. Alternatively, these cell lines can be used to investigate normal cell types from donors.

Established cell lines are immortal, either because they were cancerous (isolated from a tumor) or were immortalized by making them cancerous. As a consequence, these cell lines primarily serve to investigate cancer related biology.

From a modeling standpoint, beyond just the interpretation of results, the relationship of the primary cell line to the tissue of origin is much more important than for established cell lines (as they are bit Frankensteinian). As such, we may want to model the primary cell line experiments as: tissue(Uberon) == tissue origin --> sample_type == primary cell line --> cell_type == whatever found. Whereas established cell lines would be modeled as: tissue == (CLO) --> sample_type == established cell line --> cell_type (CL) == "unknown".

-One immediate impact for the user would be the ability to filter for or against cell lines. In my experience as a data analyst, I have done this frequently. -Another benefit would be to collect the CLO labels of established cell lines. These labels are not only controlled by the domain ontology but stem from the manufacturers/distributors who maintain and monitor these lines. -Lastly, we could potentially extend our support for cell lines by using the "sample_type" as a filter within tools such as scExpression. As of right now, I believe these experiments are left out because the Uberon_ID+(label) does not make it into the selection list. With the added criteria we can give the UX designers and ultimately the users more ways to interrogate the corpus.

jahilton commented 2 years ago

Outdated At ENCODE, we had a difficult time zeroing in on a single ontology for cell lines. To get full coverage, we would have needed to accept CLO and EFO, maybe BTO, as well. ~~If we had a list of cell lines used thus far in the single-cell community, would be great to see which have terms in which ontologies. If CLO doesn't have some terms, I don't have a lot of faith in getting them added based on this comment.~~

brianraymor commented 2 years ago

Outdated That was timely, I was wondering about EFO and CLO based on this reference.

BAevermann commented 2 years ago

Outdated ~~That comment doesnt bode well. Does the ATCC release a downloadable catalogue? (I don't see it on their website, but I would imagine that we arent the only ones who would want it...)~~

LuckyMD commented 2 years ago

Outdated Does sample_type also include the way that samples are taken from a donor? In the HLCA we used a field named exactly this to encode things like "donor lung", "biopsy", "brush", "surgical resection", etc (see sample metadata from HLCA in download link here).

If it should contain this, what about cases where a sample was taken from a donor by biopsy and then cultured in a primary culture? Would you then assign two different ontology terms?

If it shouldn't contain this, it might be worth making the naming more clear.

brianraymor commented 1 year ago

Outdated

@BAevermann wrote: > Does the ATCC release a downloadable catalogue?

I have not seen a consumable spreadsheet.

There are resources like Cell Lines by Gene Mutation. See their Reference Material

brianraymor commented 1 year ago

Per discussion with @BAevermann @jahilton @norbid , there is consensus to make minor changes to the representation to make it easier for downstream dependents to filter cases without making string comparisons on labels. Currently, there is not adequate ontology support to introduce further refinements for cell cultures and organoids as originally suggested.

The proposal is to replace ad-hoc labels with an obs column (sample_type or tissue_type) with an enumeration of values limited to:

Example for brain organoid:

tissue_type: "organoid" tissue_ontology_term_id: UBERON:0000955 tissue_ontology_label: "brain"

Example for cell culture:

tissue_type: "cell culture" tissue_ontology_term_id: CL:0000057 tissue_ontology_label: "fibroblast"

CC: @pablo-gar - thoughts?

pablo-gar commented 1 year ago

What's the difference between "Tissue" and "Isolated Cells from a Tissue"?

pablo-gar commented 1 year ago

Also, we will need to do a coordinated effort to update the Census builder, as organoids and cell types must not be included per the census schema.

https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#sample-types

jahilton commented 1 year ago

I think Enriched, Sorted, or Isolated Cells from a Tissue should not be an option. Thus far, these have been represented the same way as any Tissue sample. So we don't have any way of knowing which Datasets in the corpus currently fit this description. And experimentally, you can get into some serious gray area about how much can be done to a suspension before it's considered enriched/sorted/isolated above normal procedures for tissue sampling.

pablo-gar commented 1 year ago

Agreed, perfect example is PBMCs they are strictly isolated cells but for most analysis/integrations they are effectively treated as "tissue" samples.

Maybe:

brianraymor commented 1 year ago

Also, we will need to do a coordinated effort to update the Census builder, as organoids and cell types must not be included per the census schema.

When we refine the related parent epic - all downstream dependencies will be modeled in zenhub. That's how we manage schema changes.

brianraymor commented 1 year ago

I'd prefer that Tissue requires UBERON and Cell Culture requires CL. And if there are additional categories, then they must use either UBERON or CL but not both. Otherwise, it's a grab bag.

BAevermann commented 1 year ago

I agree that "Enriched, Sorted, or Isolated Cells from a Tissue" would be a slippery slope especially since updating published data would be a large curation task. I think moving forward with the possible entries being Tissue, Cell Culture, and Organoid and revise if needed.

brianraymor commented 1 year ago

I did a quick assessment. There are NO cases in the corpus like the one described by:

_However, in the case of EPCAM+ cervical cells, use "CL:000066" for epithelial cell of the cervix._

When CL is present in Tissues, it's always annotated as a cell culture:

{'CL:0000010 (cell culture)',
 'CL:0000082 (cell culture)',
 'CL:0000115 (cell culture)',
 'CL:0002322 (cell culture)',
 'CL:0002327 (cell culture)',
 'CL:0002328 (cell culture)',
 'CL:0002633 (cell culture)',
 'CL:0010003 (cell culture)',
 'UBERON:0000002',