chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Reserved Names and Uniqueness requirements must be clarified #641

Closed brianraymor closed 8 months ago

brianraymor commented 1 year ago

Design

Note: I'm preferring a hand-wavy "metadata fields" rather than getting into details of keys or column names depending on the Anndata section in question. And obviously, some Anndata sections do not allow duplicates. I could create a table per Anndata section if that would be preferred.

General Requirements

...

Reserved Names. The names of metadata fields MUST NOT start with "__". The names of the metadata fields specified by the schema are reserved for the purposes and specifications described in the schema.

Unique Names. The names of schema and data submitter metadata fields in obs and var MUST be unique. For example, duplicate "feature_biotype" keys in AnnData var are not allowed.

Note: I will also be remodeling the Annotator in all metadata fields from:

Key myKey
Annotator Curator
Value numpy.ndarray


to something like:

Key myKey
Annotator Curator MUST annotate.
Value numpy.ndarray


Context

See #single-cell-four.

Reserved Names in General Requirements has been previously reviewed, but needs further clarification for new readers.

Note: when I revisited the rationale for using "Key" as the standard name in the schema field tables, I was reminded that keys is a common operation for both obs (DataFrame) and uns (dictionary).


adata.obs.keys()
adata.uns.keys()
danieljhegeman commented 9 months ago

Changed

Uniqueness must be enforced for both author and schema fields

to

Uniqueness must be enforced for both author-provided and schema-defined fields

for clarity