Closed brianraymor closed 1 year ago
Outdated
What does
c(primary, established)
mean?Is there a reason
Tissue
is not lowercase?Is part of the proposal to no longer append tissue_ontology_term_id/tissue values with (organoid)/(cell culture)?
For alignment of the two properties, I would propose tissue_ontology_term_id --> sample_ontology_term_id
This also avoids annotating tissue_* information for non-tissue data.
Outdated
-The primary/established were referring to type of cell cultures we might want to capture in more detail.
-I fixed the capital T in tissue.
-This would be an alternative to appending it to the Uberon term. From the outreach conducted thus far, we are unique in representing data this way.
I support the renaming to "sample_ontology_term_id". In this case would we propose adding CLO ids to the list of valid values in support of established cultures?
Outdated
Why would we ask curators to capture primary
or established
? Can you speak more to the subsequent value for data consumers? Generally, it would be helpful to share use cases or background as @jahilton did in this case to further my education.
Outdated
Primary cell lines and Established cell lines test very different biological hypotheses.
Primary cell lines are cells that were taken from a donor and cultured for a limited time. These can be used to test things like a disease mechanism in a specific cell type from a patient. Alternatively, these cell lines can be used to investigate normal cell types from donors.
Established cell lines are immortal, either because they were cancerous (isolated from a tumor) or were immortalized by making them cancerous. As a consequence, these cell lines primarily serve to investigate cancer related biology.
From a modeling standpoint, beyond just the interpretation of results, the relationship of the primary cell line to the tissue of origin is much more important than for established cell lines (as they are bit Frankensteinian). As such, we may want to model the primary cell line experiments as: tissue(Uberon) == tissue origin --> sample_type == primary cell line --> cell_type == whatever found. Whereas established cell lines would be modeled as: tissue == (CLO) --> sample_type == established cell line --> cell_type (CL) == "unknown".
-One immediate impact for the user would be the ability to filter for or against cell lines. In my experience as a data analyst, I have done this frequently.
-Another benefit would be to collect the CLO labels of established cell lines. These labels are not only controlled by the domain ontology but stem from the manufacturers/distributors who maintain and monitor these lines.
-Lastly, we could potentially extend our support for cell lines by using the "sample_type" as a filter within tools such as scExpression. As of right now, I believe these experiments are left out because the Uberon_ID+(label) does not make it into the selection list. With the added criteria we can give the UX designers and ultimately the users more ways to interrogate the corpus.
Outdated
At ENCODE, we had a difficult time zeroing in on a single ontology for cell lines. To get full coverage, we would have needed to accept CLO and EFO, maybe BTO, as well.
~~If we had a list of cell lines used thus far in the single-cell community, would be great to see which have terms in which ontologies.
If CLO doesn't have some terms, I don't have a lot of faith in getting them added based on this comment.~~
Outdated
That was timely, I was wondering about EFO and CLO based on this reference.
Outdated ~~That comment doesnt bode well. Does the ATCC release a downloadable catalogue? (I don't see it on their website, but I would imagine that we arent the only ones who would want it...)~~
Outdated
Does sample_type
also include the way that samples are taken from a donor? In the HLCA we used a field named exactly this to encode things like "donor lung", "biopsy", "brush", "surgical resection", etc (see sample metadata from HLCA in download link here).
If it should contain this, what about cases where a sample was taken from a donor by biopsy and then cultured in a primary culture? Would you then assign two different ontology terms?
If it shouldn't contain this, it might be worth making the naming more clear.
Outdated
@BAevermann wrote:
> Does the ATCC release a downloadable catalogue?
I have not seen a consumable spreadsheet.
There are resources like Cell Lines by Gene Mutation. See their Reference Material
Per discussion with @BAevermann @jahilton @norbid , there is consensus to make minor changes to the representation to make it easier for downstream dependents to filter cases without making string comparisons on labels. Currently, there is not adequate ontology support to introduce further refinements for cell cultures and organoids as originally suggested.
The proposal is to replace ad-hoc labels with an obs column (sample_type
or tissue_type
) with an enumeration of values limited to:
Example for brain organoid:
tissue_type: "organoid" tissue_ontology_term_id: UBERON:0000955 tissue_ontology_label: "brain"
Example for cell culture:
tissue_type: "cell culture" tissue_ontology_term_id: CL:0000057 tissue_ontology_label: "fibroblast"
CC: @pablo-gar - thoughts?
What's the difference between "Tissue" and "Isolated Cells from a Tissue"?
Also, we will need to do a coordinated effort to update the Census builder, as organoids and cell types must not be included per the census schema.
I think Enriched, Sorted, or Isolated Cells from a Tissue
should not be an option.
Thus far, these have been represented the same way as any Tissue sample. So we don't have any way of knowing which Datasets in the corpus currently fit this description.
And experimentally, you can get into some serious gray area about how much can be done to a suspension before it's considered enriched/sorted/isolated above normal procedures for tissue sampling.
Agreed, perfect example is PBMCs they are strictly isolated cells but for most analysis/integrations they are effectively treated as "tissue" samples.
Maybe:
Also, we will need to do a coordinated effort to update the Census builder, as organoids and cell types must not be included per the census schema.
When we refine the related parent epic - all downstream dependencies will be modeled in zenhub. That's how we manage schema changes.
I'd prefer that Tissue requires UBERON and Cell Culture requires CL. And if there are additional categories, then they must use either UBERON or CL but not both. Otherwise, it's a grab bag.
I agree that "Enriched, Sorted, or Isolated Cells from a Tissue" would be a slippery slope especially since updating published data would be a large curation task. I think moving forward with the possible entries being Tissue, Cell Culture, and Organoid and revise if needed.
I did a quick assessment. There are NO cases in the corpus like the one described by:
_However, in the case of EPCAM+ cervical cells, use "CL:000066" for epithelial cell of the cervix._
When CL is present in Tissues, it's always annotated as a cell culture:
{'CL:0000010 (cell culture)',
'CL:0000082 (cell culture)',
'CL:0000115 (cell culture)',
'CL:0002322 (cell culture)',
'CL:0002327 (cell culture)',
'CL:0002328 (cell culture)',
'CL:0002633 (cell culture)',
'CL:0010003 (cell culture)',
'UBERON:0000002',
Context
See Should organoids and cell cultures be updated or removed?
Design
obs
(Cell Metadata)obs
is apandas.DataFrame
.Curators MUST annotate the following columns in the
obs
dataframe:tissue_type
str
categories. This MUST be"tissue"
,"organoid"
, or"cell culture"
.tissue_ontology_term_id
str
categories. Iftissue_type
is"tissue"
or"organoid"
, this MUST be the most accurate child of UBERON:0001062 for anatomical entity.If
tissue_type
is"cell culture"
this MUST follow the requirements forcell_type_ontology_term_id
.
...
When a dataset is uploaded, CELLxGENE DIscover MUST automatically add the matching human-readable name for the corresponding ontology term to the
obs
dataframe. Curators MUST NOT annotate the following columns.tissue
str
categories. This MUST be the human-readable name assigned to the value oftissue_ontology_term_id
.Appendix A. Changelog
schema v4.0.0
tissue_type
" (cell culture)"
and" (organoid)"
suffixes fromtissue_ontology_term_id
andtissue
.