chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Evolve development_stage_ontology_term_id to support multiple species #1033

Open brianraymor opened 1 week ago

brianraymor commented 1 week ago

Please review the current schema requirements for development_stage_ontology_term_id.

The proposal is to adopt a conceptual model similar to BGEE. Note: This may be outdated. It also does not model all species that are prioritized for CELLxGENE.

The top-level requirement becomes:

Thjs MUST be the most accurate descendant of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage.

For cases where species specific development stages ontologies such as HsapDv, MmusDv, or ZFS exist, there will also be a table of STRONGLY RECOMMENDED UBERON and species specific ontology terms.

For cases where species specific development stage ontologies do not exist such as Ambystoma mexicanum (axolotl) AND there is a community source for documented development stages such as a published paper, there will be also be a table of STRONGLY RECOMMENDED UBERON terms mapped to the published development stages for the species. (There are details/examples in the linked issue)

For cases where species specific development stage ontologies do not exist AND there is no community source for documented development stages such as a published paper,, then no further STRONGLY RECOMMENDED guidance is offered. Note: This may be theoretical.

BAevermann commented 1 week ago

For cases where there is a species specific development stages ontology, why not consider there usage as "REQUIRED"? I am specifically thinking about human and mouse as the terms in UBERON are clear downgrade as compared to the curation currently available.

brianraymor commented 1 week ago

We depend on the kindness of curators to define the most accurate development stage terms. For example, the schema only requires

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this MUST be the most accurate descendant of HsapDv:0000001 for life cycle with the following STRONGLY RECOMMENDED: ... followed by a list of HsapDv terms.

There's nothing preventing a submitter from selecting a high-level HsapDv term such as embyronic stage.

Further, the development stage ontologies duplicate the UBERON high-level hierarchical terms for stages such as blastula stage. For example, HsapDv vs UBERON.

The schema could certainly define tables per species with REQUIRED and STRONGLY RECOMMENDED UBERON and species specific ontology terms.

For Use
UBERON stage A term from the set of Carnegie stages 1-23
(up to 8 weeks after conception; e.g. HsapDv:0000003)
UBERON stage A term from the set of 9 to 38 week post-fertilization human stages
(9 weeks after conception and before birth; e.g. HsapDv:0000046)
      <br>

If @jahilton and @jychien believe that we could strengthen the requirements for development stages to block high-level stages, then that's another possibility - MUST USE A term from the set of Carnegie stages 1-23

Currently, we're in the middle of the multiple species and relaxed schema experiment - but if multiple species begin to surface in the CELLxGENE Discover UX, then I'd expect that @niknak33 and @hthomas-czi may prefer to simplify the Development Stages UX Filter to be species neutral and rely more on the UBERON terms. The current design was based on constraints that are no longer valid.

jahilton commented 3 days ago

I would support requiring the species-specific Dv ontology to be used, like we currently do for human & mouse, "For cases where species specific development stages ontologies...exist". I don't see any reason to allow an UBERON term in those cases.