Add CAP specific fields to extension

Added CAP specific fields from https://github.com/cellannotation/cap_file_planning/blob/main/cap_anndata_schema.md#cap-encoding-for-anndata-file to CAP_extension.json.

I have couple of question regarding with the changes;

How do I handle required upon publication property of the fields?

The following file is generated when the extension and the general schema is merged;

{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "General Cell Annotation Open Standard",
"description": "A general, open-standard schema for cell annotations which records connections, types, provenance and evidence.\nThis is designed not to tie-in to a single project (i.e. no tool-specific fields in core schema),and allows for extensions to support ad hoc user fields, new formal schema extensions, and project/tool specific metadata.",
"type": "object",
"definitions": {
"Labelset": {
  "properties": {
    "name": {
      "type": "string",
      "description": "name of annotation key"
    },
    "description": {
      "type": "string",
      "description": "Some text describing what types of cell annotation this annotation key is used to record"
    },
    "annotation_method": {
      "description": "The method used for creating the cell annotations. This MUST be one of the following strings: `'algorithmic'`, `'manual'`, or `'both'` ",
      "type": "string",
      "enum": [
        "algorithmic",
        "manual",
        "both"
      ]
    },
    "automated_annotation": {
      "type": "object",
      "$ref": "#/definitions/automated_annotation"
    }
  },
  "required": [
    "name"
  ]
},
"automated_annotation": {
  "type": "object",
  "description": "A set of fields for recording the details of the automated annotation algorithm used.\n(Common 'automated annotation methods' would include PopV, Azimuth, CellTypist, scArches, etc.)",
  "properties": {
    "algorithm_name": {
      "type": "string",
      "description": "The name of the algorithm used. It MUST be a string of the algorithm's name."
    },
    "algorithm_version": {
      "type": "string",
      "description": "The version of the algorithm used (if applicable). It MUST be a string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]', but other versioning systems are permitted (based on the algorithm's versioning)."
    },
    "algorithm_repo_url": {
      "type": "string",
      "description": "This field denotes the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL."
    },
    "reference_location": {
      "type": "string",
      "description": "This field denotes a valid URL of the annotated dataset that was the source of annotated reference data. \nThis MUST be a string of a valid URL. The concept of a 'reference' specifically refers to 'annotation transfer' algorithms, whereby a 'reference' dataset is used to transfer cell annotations to the 'query' dataset.",
      "$comment": "This must be optional as it does not apply in all cases - e.g. in the case ML based annotation with no single reference dataset."
    }
  },
  "required": [
    "algorithm_name",
    "algorithm_version",
    "algorithm_repo_url"
  ]
},
"Annotation": {
  "type": "object",
  "description": "A collection of fields recording a cell type/class/state annotation on some set os cells, supporting evidence and provenance. As this is intended as a general schema, compulsory fields are kept to a minimum. However, tools using this schema are encouarged to specify a larger set of compulsory fields for publication. \n\nNote: This schema deliberately allows for additional fields in order to support ad hoc user fields, new formal schema extensions and project/tool specific metadata.",
  "required": [
    "labelset",
    "cell_label"
  ],
  "properties": {
    "labelset": {
      "description": "The unique name of the set of cell annotations. \nEach cell within the AnnData/Seurat file MUST be associated with a 'cell_label' value in order for this to be a valid 'cellannotation_setname'.",
      "type": "string"
    },
    "cell_label": {
      "description": "This denotes any free-text term which the author uses to annotate cells, i.e. the preferred cell label name used by the author. Abbreviations are exceptable in this field; refer to 'cell_fullname' for related details. \nCertain key words have been reserved:\n- `'doublets'` is reserved for encoding cells defined as doublets based on some computational analysis\n- `'junk'` is reserved for encoding cells that failed sequencing for some reason, e.g. few genes detected, high fraction of mitochondrial reads\n- `'unknown'` is explicitly reserved for unknown or 'author does not know'\n- `'NA'` is incomplete, i.e. no cell annotation was provided",
      "type": "string"
    },
    "cell_fullname": {
      "description": "This MUST be the full-length name for the biological entity listed in `cell_label` by the author. (If the value in `cell_label` is the full-length term, this field will contain the same value.) \nNOTE: any reserved word used in the field 'cell_label' MUST match the value of this field. \n\nEXAMPLE 1: Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field 'cell_label'. However, the abbreviation 'LC' CANNOT be provided in the field 'cell_fullname'. \n\nEXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field 'cell_label', but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in the field 'cell_fullname'.",
      "type": "string"
    },
    "cell_ontology_term_id": {
      "description": "This MUST be a term from either the Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/cl) or from some ontology that extends it by classifying cell types under terms from the Cell Ontology\ne.g. the Provisional Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/pcl) or the Drosophila Anatomy Ontology (DAO) (https://www.ebi.ac.uk/ols4/ontologies/fbbt).\n\nNOTE: The closest available ontology term matching the value within the field 'cell_label' (at the time of publication) MUST be used.\nFor example, if the value of 'cell_label' is 'relay interneuron', but this entity does not yet exist in the ontology, users must choose the closest available term in the CL ontology. In this case, it's the broader term 'interneuron' i.e.  https://www.ebi.ac.uk/ols/ontologies/cl/terms?obo_id=CL:0000099.",
      "type": "string"
    },
    "cell_ontology_term": {
      "description": "This MUST be the human-readable name assigned to the value of 'cell_ontology_term_id'",
      "type": "string"
    },
    "cell_ids": {
      "type": "array",
      "description": "List of cell barcode sequences/UUIDs used to uniquely identify the cells within the AnnData/Seurat matrix. Any and all cell barcode sequences/UUIDs MUST be included in the AnnData/Seurat matrix.",
      "items": {
        "type": "string",
        "description": "Cell barcode sequences/UUIDs used to uniquely identify the cells within the AnnData/Seurat matrix. Any and all cell barcode sequences/UUIDs MUST be included in the AnnData/Seurat matrix."
      }
    },
    "rationale": {
      "description": "The free-text rationale which users provide as justification/evidence for their cell annotations. \nResearchers are encouraged to use this field to cite relevant publications in-line using standard academic citations of the form `(Zheng et al., 2020)` This human-readable free-text MUST be encoded as a single string.\nAll references cited SHOULD be listed using DOIs under rationale_dois. There MUST be a 2000-character limit.",
      "type": "string",
      "maxLength": 2000
    },
    "rationale_dois": {
      "description": "A list of valid publication DOIs cited by the author to support or provide justification/evidence/context for 'cell_label'.",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "marker_gene_evidence": {
      "description": "List of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file.",
      "type": "array",
      "items": {
        "type": "string",
        "description": "Gene names explicitly used as evidence, which MUST be in the matrix of the AnnData/Seurat file"
      }
    },
    "synonyms": {
      "description": "This field denotes any free-text term of a biological entity which the author associates as synonymous with the biological entity listed in the field 'cell_label'.\nIn the case whereby no synonyms exist, the authors MAY leave this as blank, which is encoded as 'NA'. However, this field is NOT OPTIONAL.",
      "type": "array",
      "items": {
        "type": "string",
        "description": "List of synonyms"
      }
    }
  }
}
},
"required": [
"author_name",
"annotations",
"cellannotation_schema_version",
"cellannotation_timestamp",
"cellannotation_version",
"cellannotation_url"
],
"properties": {
"cellannotation_schema_version": {
  "description": "The schema version, the cell annotation open standard. Current version MUST follow 0.1.0\nThis versioning MUST follow the format `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/",
  "type": "string"
},
"cellannotation_timestamp": {
  "description": "The timestamp of all cell annotations published (per dataset). This MUST be a string in the format `'%yyyy-%mm-%dd %hh:%mm:%ss'`",
  "type": "string",
  "format": "date-time"
},
"cellannotation_version": {
  "description": "The version for all cell annotations published (per dataset). This MUST be a string. The recommended versioning format is `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/",
  "type": "string"
},
"cellannotation_url": {
  "description": "A persistent URL of all cell annotations published (per dataset). ",
  "type": "string"
},
"author_name": {
  "description": "This MUST be a string in the format `[FIRST NAME] [LAST NAME]`",
  "type": "string"
},
"author_contact": {
  "description": "This MUST be a valid email address of the author",
  "type": "string",
  "format": "email"
},
"orcid": {
  "description": "This MUST be a valid ORCID for the author",
  "type": "string"
},
"labelsets": {
  "type": "array",
  "$ref": "#/definitions/Labelset"
},
"annotations": {
  "type": "array",
  "items": {
    "$ref": "#/definitions/Annotation"
  }
},
"cap_publication_title": {
  "type": "string",
  "description": "The title of the publication on CAP (i.e. a published collection of datasets, the \"CAP Workspace\".). The title of the publication on CAP. (NOTE: the term \"publication\" refers to the workspace published on CAP with a version and timestamp.) This MUST be less than or equal to N characters, and this MUST be encoded as a single string."
},
"cap_publication_description": {
  "type": "string",
  "description": "The description of the publication on CAP. The description of the publication on CAP. (NOTE: the term \"publication\" refers to the workspace published on CAP with a version and timestamp.) This MUST be less than or equal to N characters, and this MUST be encoded as a single string."
},
"cap_publication_url": {
  "type": "string",
  "description": "A persistent URL of the publication on CAP. (NOTE: the term \"publication\" refers to the workspace published on CAP with a version and timestamp.)"
},
"cap_publication_timestamp": {
  "type": "string",
  "description": "The timestamp of the CAP publication. This MUST be a string in the format %yyyy-%MM-%dd'T'%hh:%mm:%ss. This value will be overwritten by the newest timestamp upon a new publication."
},
"cap_publication_version": {
  "type": "string",
  "description": "The (latest) version of the CAP publication. This value will be overwritten by the newest version upon a new publication (and automatically incremented). This versioning MUST follow the format 'v' + '[integer]', whereby newer versions must be naturally incremented."
},
"cap_author_name": {
  "type": "string",
  "description": "The author name from the CAP username who published this cell annotation set. This MUST be a string in the format [FIRST NAME] [LAST NAME]."
},
"cap_author_contact": {
  "type": "string",
  "description": "The contact email address from the CAP username who published this cell annotation set. This MUST be a valid email address of the author."
},
"cap_author_orcid": {
  "type": "string",
  "description": "The ORCID ID associated with the CAP username/author who published the cell annotation set. This MUST be a valid ORCID for the author."
}
}
}

Fields like author_name, author_contract and author_orcid seems like redundant. Is this acceptable?

There are some fields in https://github.com/cellannotation/cap_file_planning/blob/main/cap_anndata_schema.md#cap-encoding-for-anndata-file that do not exist in the general schema, am I suppose to add those fields to the extension as well?

@evanbiederstedt - rather than add these CAP specific fields, for most or all fields can we use the generic equivalents and move the cap business logic (e.g. "NOTE: the term \"publication\" refers to the workspace published on CAP with a version and timestamp.") into the specification of how to CAP should write to the generic fields. This could live a separate field in the CAP extension.

I suspect one major issue is around making sure CAP has credit, maybe we could do this with a separate field for recording annotation tool - this could be used by our Taxonomy Development Tools and Cytosplore (both for BICAN).

Also - I think the fields @ubyndr has copied across don't yet reflect the CAP mechanism for dealing with multiple authors.

Ping @evanbiederstedt - can we reconcile? See comments above.

@evanbiederstedt comment on slack:

"Yes, let's remove the CAP prefixes."

So I think the next step is to remove these and merge fields whose names are otherwise identical. If the CAP definition doesn't say anything substantive over the CAS def other than mentioning , the CAP def should be discarded. Anything else we can discuss.

Given the schema release, @rm1113 would be in the best position to review this

@mfutey please check how updated the field descriptions are as well if you could

Rolled schema files: schemas.zip

BICAN schema documentation

# General Cell Annotation Open Standard *A general, open-standard schema for cell annotations which records connections, types, provenance and evidence. This is designed not to tie-in to a single project (i.e. no tool-specific fields in core schema),and allows for extensions to support ad hoc user fields, new formal schema extensions, and project/tool specific metadata.* - [Properties](#properties) ## Properties - **`matrix_file_id`** *(string)*: A resolvable ID for a cell by gene matrix file in the form namespace:accession, e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122. Please see https://github.com/cellannotation/cell-annotation-schema/registry/registry.json for supported namespaces. - **`cellannotation_schema_version`** *(string)*: The schema version, the cell annotation open standard. Current version MUST follow 0.1.0This versioning MUST follow the format `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/. - **`cellannotation_timestamp`** *(string, format: date-time)*: The timestamp of all cell annotations published (per dataset). This MUST be a string in the format `'%yyyy-%mm-%dd %hh:%mm:%ss'`. - **`cellannotation_version`** *(string)*: The version for all cell annotations published (per dataset). This MUST be a string. The recommended versioning format is `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/. - **`cellannotation_url`** *(string)*: A persistent URL of all cell annotations published (per dataset). . - **`author_name`** *(string)*: This MUST be a string in the format `[FIRST NAME] [LAST NAME]`. - **`author_contact`** *(string, format: email)*: This MUST be a valid email address of the author. - **`orcid`** *(string)*: This MUST be a valid ORCID for the author. - **`labelsets`** *(list)* - **`name`** *(string, required)*: name of annotation key. - **`description`** *(string)*: Some text describing what types of cell annotation this annotation key is used to record. - **`annotation_method`** *(string)*: The method used for creating the cell annotations. This MUST be one of the following strings: `'algorithmic'`, `'manual'`, or `'both'` . Must be one of: `["algorithmic", "manual", "both"]`. - **`automated_annotation`** *(object)*: - **`algorithm_name`** *(string, required)*: The name of the algorithm used. It MUST be a string of the algorithm's name. - **`algorithm_version`** *(string, required)*: The version of the algorithm used (if applicable). It MUST be a string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]', but other versioning systems are permitted (based on the algorithm's versioning). - **`algorithm_repo_url`** *(string, required)*: This field denotes the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL. - **`reference_location`** *(string)*: This field denotes a valid URL of the annotated dataset that was the source of annotated reference data. This MUST be a string of a valid URL. The concept of a 'reference' specifically refers to 'annotation transfer' algorithms, whereby a 'reference' dataset is used to transfer cell annotations to the 'query' dataset. - **`rank`** *(integer)*: A number indicating relative granularity with 0 being the most specific. Use this where a single dataset has multiple keys that are used consistently to record annotations and different levels of granularity. - **`annotations`** *(list)* - **`labelset`** *(string, required)*: The unique name of the set of cell annotations. Each cell within the AnnData/Seurat file MUST be associated with a 'cell_label' value in order for this to be a valid 'cellannotation_setname'. - **`cell_label`** *(string, required)*: This denotes any free-text term which the author uses to annotate cells, i.e. the preferred cell label name used by the author. Abbreviations are exceptable in this field; refer to 'cell_fullname' for related details. Certain key words have been reserved:- `'doublets'` is reserved for encoding cells defined as doublets based on some computational analysis- `'junk'` is reserved for encoding cells that failed sequencing for some reason, e.g. few genes detected, high fraction of mitochondrial reads- `'unknown'` is explicitly reserved for unknown or 'author does not know'- `'NA'` is incomplete, i.e. no cell annotation was provided. - **`cell_fullname`** *(string)*: This MUST be the full-length name for the biological entity listed in `cell_label` by the author. (If the value in `cell_label` is the full-length term, this field will contain the same value.) NOTE: any reserved word used in the field 'cell_label' MUST match the value of this field.
EXAMPLE 1: Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field 'cell_label'. However, the abbreviation 'LC' CANNOT be provided in the field 'cell_fullname'.
EXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field 'cell_label', but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in the field 'cell_fullname'. - **`cell_ontology_term_id`** *(string)*: This MUST be a term from either the Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/cl) or from some ontology that extends it by classifying cell types under terms from the Cell Ontologye.g. the Provisional Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/pcl) or the Drosophila Anatomy Ontology (DAO) (https://www.ebi.ac.uk/ols4/ontologies/fbbt).
NOTE: The closest available ontology term matching the value within the field 'cell_label' (at the time of publication) MUST be used.For example, if the value of 'cell_label' is 'relay interneuron', but this entity does not yet exist in the ontology, users must choose the closest available term in the CL ontology. In this case, it's the broader term 'interneuron' i.e. https://www.ebi.ac.uk/ols/ontologies/cl/terms?obo_id=CL:0000099. - **`cell_ontology_term`** *(string)*: This MUST be the human-readable name assigned to the value of 'cell_ontology_term_id'. - **`cell_ids`** *(list)*: List of cell barcode sequences/UUIDs used to uniquely identify the cells within the AnnData/Seurat matrix. Any and all cell barcode sequences/UUIDs MUST be included in the AnnData/Seurat matrix. - **`rationale`** *(string)*: The free-text rationale which users provide as justification/evidence for their cell annotations. Researchers are encouraged to use this field to cite relevant publications in-line using standard academic citations of the form `(Zheng et al., 2020)` This human-readable free-text MUST be encoded as a single string.All references cited SHOULD be listed using DOIs under rationale_dois. There MUST be a 2000-character limit. - **`rationale_dois`** *(list)*: A list of valid publication DOIs cited by the author to support or provide justification/evidence/context for 'cell_label'. - **`marker_gene_evidence`** *(list)*: List of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file. - **`synonyms`** *(list)*: This field denotes any free-text term of a biological entity which the author associates as synonymous with the biological entity listed in the field 'cell_label'.In the case whereby no synonyms exist, the authors MAY leave this as blank, which is encoded as 'NA'. However, this field is NOT OPTIONAL. - **`cell_set_accession`** *(string)*: An identifier that can be used to consistently refer to the set of cells being annotated, even if the cell_label changes. - **`parent_cell_set_accessions`** *(list)*: A list of accessions of cell sets that subsume this cell set. This can be used to compose hierarchies of annotated cell sets, built from a fixed set of clusters. - **`transferred_annotations`** *(list)* - **`transferred_cell_label`** *(string)*: Transferred cell label. - **`source_taxonomy`** *(string)*: PURL of source taxonomy. - **`source_node_accession`** *(string)*: accession of node that label was transferred from. - **`algorithm_name`** *(string)*: . - **`comment`** *(string)*: Free text comment on annotation transfer.

CAP schema documentation

# General Cell Annotation Open Standard *A general, open-standard schema for cell annotations which records connections, types, provenance and evidence. This is designed not to tie-in to a single project (i.e. no tool-specific fields in core schema),and allows for extensions to support ad hoc user fields, new formal schema extensions, and project/tool specific metadata.* - [Properties](#properties) ## Properties - **`matrix_file_id`** *(string)*: A resolvable ID for a cell by gene matrix file in the form namespace:accession, e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122. Please see https://github.com/cellannotation/cell-annotation-schema/registry/registry.json for supported namespaces. - **`cellannotation_schema_version`** *(string)*: The schema version, the cell annotation open standard. Current version MUST follow 0.1.0This versioning MUST follow the format `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/. - **`cellannotation_timestamp`** *(string, format: date-time)*: The timestamp of all cell annotations published (per dataset). This MUST be a string in the format `'%yyyy-%mm-%dd %hh:%mm:%ss'`. - **`cellannotation_version`** *(string)*: The version for all cell annotations published (per dataset). This MUST be a string. The recommended versioning format is `'[MAJOR].[MINOR].[PATCH]'` as defined by Semantic Versioning 2.0.0, https://semver.org/. - **`cellannotation_url`** *(string)*: A persistent URL of all cell annotations published (per dataset). . - **`author_name`** *(string)*: This MUST be a string in the format `[FIRST NAME] [LAST NAME]`. - **`author_contact`** *(string, format: email)*: This MUST be a valid email address of the author. - **`orcid`** *(string)*: This MUST be a valid ORCID for the author. - **`labelsets`** *(list)* - **`name`** *(string, required)*: name of annotation key. - **`description`** *(string)*: Some text describing what types of cell annotation this annotation key is used to record. - **`annotation_method`** *(string)*: The method used for creating the cell annotations. This MUST be one of the following strings: `'algorithmic'`, `'manual'`, or `'both'` . Must be one of: `["algorithmic", "manual", "both"]`. - **`automated_annotation`** *(object)*: - **`algorithm_name`** *(string, required)*: The name of the algorithm used. It MUST be a string of the algorithm's name. - **`algorithm_version`** *(string, required)*: The version of the algorithm used (if applicable). It MUST be a string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]', but other versioning systems are permitted (based on the algorithm's versioning). - **`algorithm_repo_url`** *(string, required)*: This field denotes the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL. - **`reference_location`** *(string)*: This field denotes a valid URL of the annotated dataset that was the source of annotated reference data. This MUST be a string of a valid URL. The concept of a 'reference' specifically refers to 'annotation transfer' algorithms, whereby a 'reference' dataset is used to transfer cell annotations to the 'query' dataset. - **`annotations`** *(list)* - **`labelset`** *(string, required)*: The unique name of the set of cell annotations. Each cell within the AnnData/Seurat file MUST be associated with a 'cell_label' value in order for this to be a valid 'cellannotation_setname'. - **`cell_label`** *(string, required)*: This denotes any free-text term which the author uses to annotate cells, i.e. the preferred cell label name used by the author. Abbreviations are exceptable in this field; refer to 'cell_fullname' for related details. Certain key words have been reserved:- `'doublets'` is reserved for encoding cells defined as doublets based on some computational analysis- `'junk'` is reserved for encoding cells that failed sequencing for some reason, e.g. few genes detected, high fraction of mitochondrial reads- `'unknown'` is explicitly reserved for unknown or 'author does not know'- `'NA'` is incomplete, i.e. no cell annotation was provided. - **`cell_fullname`** *(string)*: This MUST be the full-length name for the biological entity listed in `cell_label` by the author. (If the value in `cell_label` is the full-length term, this field will contain the same value.) NOTE: any reserved word used in the field 'cell_label' MUST match the value of this field.
EXAMPLE 1: Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field 'cell_label'. However, the abbreviation 'LC' CANNOT be provided in the field 'cell_fullname'.
EXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field 'cell_label', but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in the field 'cell_fullname'. - **`cell_ontology_term_id`** *(string)*: This MUST be a term from either the Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/cl) or from some ontology that extends it by classifying cell types under terms from the Cell Ontologye.g. the Provisional Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/pcl) or the Drosophila Anatomy Ontology (DAO) (https://www.ebi.ac.uk/ols4/ontologies/fbbt).
NOTE: The closest available ontology term matching the value within the field 'cell_label' (at the time of publication) MUST be used.For example, if the value of 'cell_label' is 'relay interneuron', but this entity does not yet exist in the ontology, users must choose the closest available term in the CL ontology. In this case, it's the broader term 'interneuron' i.e. https://www.ebi.ac.uk/ols/ontologies/cl/terms?obo_id=CL:0000099. - **`cell_ontology_term`** *(string)*: This MUST be the human-readable name assigned to the value of 'cell_ontology_term_id'. - **`cell_ids`** *(list)*: List of cell barcode sequences/UUIDs used to uniquely identify the cells within the AnnData/Seurat matrix. Any and all cell barcode sequences/UUIDs MUST be included in the AnnData/Seurat matrix. - **`rationale`** *(string)*: The free-text rationale which users provide as justification/evidence for their cell annotations. Researchers are encouraged to use this field to cite relevant publications in-line using standard academic citations of the form `(Zheng et al., 2020)` This human-readable free-text MUST be encoded as a single string.All references cited SHOULD be listed using DOIs under rationale_dois. There MUST be a 2000-character limit. - **`rationale_dois`** *(list)*: A list of valid publication DOIs cited by the author to support or provide justification/evidence/context for 'cell_label'. - **`marker_gene_evidence`** *(list)*: List of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file. - **`synonyms`** *(list)*: This field denotes any free-text term of a biological entity which the author associates as synonymous with the biological entity listed in the field 'cell_label'.In the case whereby no synonyms exist, the authors MAY leave this as blank, which is encoded as 'NA'. However, this field is NOT OPTIONAL. - **`publication_title`** *(string)*: The title of the publication on CAP (i.e. a published collection of datasets, the "CAP Workspace".). The title of the publication on CAP. (NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.) This MUST be less than or equal to N characters, and this MUST be encoded as a single string. - **`publication_description`** *(string)*: The description of the publication on CAP. The description of the publication on CAP. (NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.) This MUST be less than or equal to N characters, and this MUST be encoded as a single string. - **`publication_url`** *(string)*: A persistent URL of the publication on CAP. (NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.). - **`publication_timestamp`** *(string)*: The timestamp of the CAP publication. This MUST be a string in the format %yyyy-%MM-%dd'T'%hh:%mm:%ss. This value will be overwritten by the newest timestamp upon a new publication. - **`publication_version`** *(string)*: The (latest) version of the CAP publication. This value will be overwritten by the newest version upon a new publication (and automatically incremented). This versioning MUST follow the format 'v' + '[integer]', whereby newer versions must be naturally incremented. - **`author_orcid`** *(string)*: The ORCID ID associated with the CAP username/author who published the cell annotation set. This MUST be a valid ORCID for the author.

@evanbiederstedt @dosumis

We still have the cap_ prefixes in the CAP AnnData file schema.

    workspace_title = "cap_publication_title"
    workspace_description = "cap_publication_description"
    workspace_url = "cap_publication_url"
    authors_list = "cap_publication_authors_list"
    publication_timestamp = "cap_publication_timestamp"
    publication_version = "cap_publication_version"
    main_author = "cap_author_name"
    main_author_orcid = "cap_author_orcid"
    main_author_contact = "cap_author_contact"

I worry the user will be confused with fields like publication_title because it when I see this field the first thing I am thinking about the paper title but not some project title on the data portal. Maybe let's discuss each field one by one?

We'll make modifications as appropriate

@ubyndr @evanbiederstedt I reviewed the descriptions and they look good to me but as noted above this needs to be updated to reflect changes to the schema:

Some fields are missing:
canonical_marker_genes category_fullname category_cell_ontology_exists category_cell_ontology_term_id category_cell_ontology_term cell_ontology_assessment authors_list
Many of the field names need to be updated, see: #94
There seems to be duplicates, e.g. "author_name" and "cap_author_name," "orcid" and "cap_author_orcid," etc.

These are in the CAP extension:

canonical_marker_genes category_fullname category_cell_ontology_exists category_cell_ontology_term_id category_cell_ontology_term

This is TBA:

authors_list - see #41

Implemented updates based on the latest feedback and comments.

Remove the author fields from this PR. They don't correspond to current names in https://github.com/cellannotation/cell-annotation-schema/blob/main/docs/cap_anndata_schema.md. We should fix in the general schema ASAP. After that we can merge.

cellannotation / cell-annotation-schema

Add CAP specific fields to extension #46