cellannotation / cell-annotation-schema

General, open-standard schema for cell annotations
9 stars 1 forks source link

CAP <-> CAS schema sync #120

Closed dosumis closed 1 month ago

dosumis commented 1 month ago

Status: DRAFT

Evan wrote:

Note - this repo has now been retired https://github.com/cellannotation/cap_file_planning/

CAS field names and definitions can be reviewed here: https://github.com/cellannotation/cell-annotation-schema/blob/main/build/CAP_schema.md (Notes: this may lag as only updated on release; this also has CAP specific fields)

Mapping between CAP schema and CAS is here: https://github.com/cellannotation/cell-annotation-schema/issues/43#issuecomment-1836508993

From eyeballing, it looks like the definitions are in sync, but that the CAP schema file defines various new schema specification keys and splits content between them. It is not impossible to split content in this way in JSON schema - additional unspecified schema fields are allowed - they are just not read by any standard JSON Schema libs. However, splitting out content will have consequences for other projects using the schema - e.g. the BICAN taxonomy editor uses the JSON description field to populate its help fields. To do this strictly we could specify these fields as JSON schema extensions so that we can validate. This would also allow us to specify intent of fields for other users of schema.

Example:

CAS:

cell_fullname (string): This MUST be the full-length name for the biological entity listed in cell_label by the author. (If the value in cell_label is the full-length term, this field will contain the same value.) NOTE: any reserved word used in the field 'cell_label' MUST match the value of this field. EXAMPLE 1: Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field 'cell_label'. However, the abbreviation 'LC' CANNOT be provided in the field 'cell_fullname'. EXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field 'cell_label', but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in the field 'cell_fullname'.

CAP (with added comment column)

column [cellannotation_setname]--cell_fullname comment
definition This MUST be the full-length name for the biological entity listed in [cellannotation_setname] by the author. If the value in [cellannotation_setname] is the full-length term, this field will contain the same value.

NOTE: any reserved word used in the field [cellannotation_setname] MUST match the value of this field.

EXAMPLE 1:Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field [cellannotation_setname]. However, the abbreviation 'LC' CANNOT be provided in this field [cellannotation_setname]--cell_fullname.

EXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field [cellannotation_setname], but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in this field [cellannotation_setname]--cell_fullname.
Identical to CAS
index Cell barcode names Not needed in CAS - implicit in schema
dtype string. One string per cell. Redundant with formal spec in JSON schema
value The full-length name for the biological entity listed in [cellannotation_setname] by the author.NOTE: if any keyword 'doublets', 'junk', 'unknown', or 'NA' is used as a value in the field [cellannotation_setname], it MUST be used here in this column as well. Not clear to me why this is not in the definition
source file or UI CAP specific - we could add to CAP extension
required upon publication yes CAP specific - we could add to CAP extension
column name in obs required upon upload no CAP specific - we could add to CAP extension
motivation/use case Scientists MUST provide the full term they prefer for this biological entity (cell type or cell state), not an abbreviation. Potentially useful to CAS general schema
example column name If the user specified the cell annotation set as 'broad_cells1', then the name of the column in the pandas DataFrame will be 'cell_fullname--broad_cells1'. This is flattening spec. I guess useful for telling data-submitters how to manually generate flattened content
example value The field [cellannotation_setname] is reserved for abbreviations. It is reserved for any term the researcher/scientist may choose upon annotating cells. The field [cellannotation_setname]--cell_fullname MUST encode the full biological entity the researcher had in mind; no abbreviations allowed.

EXAMPLE 1:cell_label: 'AC' (abbreviation)cell_fullname:'GABAergic amacrine cell'

EXAMPLE 2:cell_label: 'LC' (abbreviation)cell_fullname: 'Luminal cell'

EXAMPLE 3:cell_label: 'Schwann cell'cell_fullname: 'Schwann cell' (same entry)
Partly redundant with examples in definition. Need a general decision on whether we have a separate field.
CAP developer note For the UI design, if a keyword ('doublets', 'junk', 'unknown', or 'NA') is chosen by the user, automatically fill in the other relevant fields with that same value. CAP business logic. Could add to CAP extension if requested.

How to proceed????

dosumis commented 1 month ago

CC @evanbiederstedt

dosumis commented 1 month ago

In discussion with @evanbiederstedt it became clear that he is happy with the simplifications to the descriptions/guidance in https://github.com/cellannotation/cell-annotation-schema/blob/main/docs/cap_anndata_schema.md . All that remains of relevance are the value and examples.

However, we have decided to use this as an opportunity to review existing field defs in CAS and try to improve them to be more READABLE. This work is described in #122

Related work to add missing fields to CAS CAP extension is in this PR #121

dosumis commented 1 month ago

Closing this ticket as superceded by ticket/PR in last comment.