AnnData file integration user stories

dosumis commented 4 months ago

Stories:

Story 1. As a taxonomy editor, I want my taxonomy to be linked to the cells being annotated in order that my annotations can be synchronised with the a representation of data about these cells, e.g. in a cell by gene matrix in the form of a file (e.g. AnnData) or a matrix store DB. I want this in order to be able to update the AnnData file for use in analysis which will inform edits to the Taxonomy. Without a robust system of linking to Cell IDs, we are relying on names and/or cluster IDs to link the two. There is a serious danger that name or ID changes will break these links. TDT solutions:

CAS stores cell IDs for clusters. A link to an H5AD file supports initial population of these IDs to a taxonomy seeded from a spreadsheet.
TDT supports testing of a linked AnnData file to see if annotations are in sync with a taxonomy and can be safely updatable.
TDT support updating of cell annotations in an AnnData file from a linked taxonomy.

Story 2. As a taxonomy editor planning to publish an AnnData file to CZ CELLxGENE I want to generate an AnnData file for submission to CellXGene that is synchronised with the latest release of my taxonomy (resolvable via a Persistent URL) this means that:

annotations are synchronised with my taxonomy
appropriate cell ontology annotations are present in the cell_type field mandataed by CZ CELLxGENE
other details of my taxonomy are stored in the the AnnData file header.
(Note - we also need mechanisms to synchronise other metadata with the fields mandatated by CZ CELLxGENE, but is probably outside the scope of TDT)

Story 3: As a taxonomy editor I wish to edit or validate a list of marker genes in TDT, ensuring that they are in-sync with the genes in a reference Cell By Gene Matrix. TDT support requires:

38
[ ] Populate gene reference table from Linked CxG file.
[ ] Addition of support for validating gene lists and generating reports - for any column with data-type gene list
[ ] Autosuggest for gene list fields.

Tasks

dosumis commented 3 months ago

@AvolaAmg @hkir-dev - please review

AvolaAmg commented 3 months ago

I re-worded story 1 and 2 a little bit. Feel free to incorporate it as much as you think is needed.

User story 1 - Keeping an up to date taxonomy as an editor and maintain informations related to Cell IDs

As an editor of a particular taxonomy, I want the information present in the taxonomy to be synchronised to the cells I am annotating. In doing so, the extended informations on the cells will also be synchronised to the taxonomy. In a taxonomy, an example of extended informations of a particular cell is the gene matrix related to that cell, this matrix can be stored as a AnnData file (extension .h5df) or as DB matrix.

what is a DB matrix? this is not part of the user story but I would like to include a small definition of it to be clearer.

The purpose of this integration is to facilitate the updating of the AnnData file, which is essential for conducting analyses that can lead to further refinements in the taxonomy. Currently, Cell IDs are linked to cluster IDs or to cluster names. However, this is a risky approach as changes in cluster names or cluster IDs could disrupt the link between Cell IDs and Clusters. To avoid breaking the links between cell IDs and cluster IDs the Taxonomy Data Tool (TDT) offers several solutions.

As a taxonomy editor, I can avoid the problem of losing the link between Cell IDs and Cluster IDs by using TDT. In fact, TDT stores the information related to each cell ID (including cluster IDs) under the Cell Annotation Schema format, by using TDT I have the informations related to each cell ID stored under the Cell Annotation Schema (CAS). In addition, as an editor I want an updated taxonomy that takes into account the informations present in linked AnnData file for each Cell IDs and I want that the taxonomy synchronises to the AnnData files when changes are applied. For this reason, as an editor I will use TDT which has several in-built features to maintain the link between cell IDs, cell informations present in AnnData files. In fact,to ensure the link between Cell IDs and informations related to Cell IDs, TDT offers a series of solutions:

TDT populates a new taxonomy by extrapolating informations from an .h5df file and stores the informations of cell IDs under the Cell Annotation Schema;
TDT supports testing of a linked AnnData file to see if annotations are in sync with a taxonomy and can be safely updatable;
TDT supports updating the cell annotations in an AnnData file from a linked taxonomy.

These TDT features are designed to streamline the process of maintaining an accurate and up-to-date taxonomy, safeguarding against the potential pitfalls associated with data linkage and synchronisation.

Story 2. User story of an editor - data synchronisation for CELLxGENE submission

As a taxonomy editor I plan to publish the taxonomy data of a specific dataset to the [CZ CELLxGENE](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data I want to generate an AnnData file for submission to CellXGene that is synchronised with the latest release of my taxonomy (resolvable via a Persistent URL) corpus. To submit the data of the taxonomy I have curated, I need an AnnData file with a series of informations required by CELLxGENES. Some of the informations that must be included in CELLxGENE can be found in the link [CZ CELLxGENE](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data I want to generate an AnnData file for submission to CellXGene that is synchronised with the latest release of my taxonomy (resolvable via a Persistent URL) and include Cell Ontology informations in the metadata which are present in the cell_type field in the standard_category in CELLxGENES. Hence, as a taxonomy editor, in order to publish the taxonomy data to CZ CELLxGENE I need to make sure that:

Annotations are synchronised with my taxonomy
Appropriate cell ontology annotations are present in the cell_type field mandated by CZ CELLxGENE. The cell_type field includes standard cell types terms without abbreviations.
Other details of my taxonomy are stored in the the AnnData file header.

The TDT allows me to fullfill this requirements once I export my taxonomy as an AnnData file.

Potential Story 4 : User story of a taxonomy editor - data curation

As a taxonomy editor, I want to curate a taxonomy and include relevant informations regarding cell types so that they can be used once the taxonomy is published. I want to include specific cell types in my taxonomy that were not already present or I want to expand some informations on present cell types. I want to include the informations on these cell types regarding their ontology to have an exhaustive taxonomy, to export the taxonomy as an AnnData file to submit to the CELLxGENE corpus, to keep one version of my taxonomy across users to maintain a collaborative effort in curating a taxonomy. By using the Taxonomy Development Tool (TDT), I can include cell ontology terms when defining specific Cell IDs. TDT supports the [Ontology Lookup Service]() so I can specify the cell ontology term and its corresponding ID using TDT. I can also describe the cell hierarchy and specify the parent term of a specific cell type and include its ontology cell term. As an editor, I want to include as much information as possible regarding a specific cell type. I can do this by following the fields supported in TDT, which follow the Cell Annotation Schema (CAS) rules. By inputing information in the mandatory fields present in TDT I can build an exhaustive taxonomy which present standard elements that can be compared to other taxonomies and be useful when I plan to transfer the informations from one taxonomy to another. In TDT, in addition to the standard fields following the CAS, I can also add customisable fields that are related to the specific taxonomy I am working on and could be relevant for the description of specific cell types.

I think it is necessary to think of customisable fields that are outside CAS to provide some examples. I need to look into it.

brain-bican / taxonomy-development-tools