Add hash accession support

dosumis commented 6 months ago

cell sets can be uniquely identified by generating a hash from a sorted list of the cell_ids that define the set, given a sufficiently powerful hashing algo to precent clashes.

Having this unique identifier for cell sets is useful for linking and resolution* - It is also useful for change management - if annotations are associated with a cell set whose composition has changed, a hash can act as a quick warning.

TBD: Use as primary ID or keep as separate cell_set_ID ?

It is tempting to use such hashes as accessions. However, annotations are attached to a combination of cell set + labelset. It is possible for a single dataset to have the same dataset represented in different labelsets. To achieve uniqueness in this context we would need either to (a) add labelset into the hash input - in which case we lose the ability to generate from sets of cell_ids in any context and changing labelset name would change the output. (b) extend the output with sometime associated with the labelset - e.g. rank. It would then be possible to derive the cell_set hash by string parsing, but this feels hacky - and relies on all labelsets having a rank.

*cas_tools already supports hash generation from IDs using the blake7 algo to produce a compact string while reducing the danger of clashes to neglible levels. see hash_demo.py

dosumis commented 6 months ago

Decision:

We have 2 accessions:

cell_set hash accession
cell_set/labelset hash accession (could be done with namespace or could be done by making has from combo of cell ids and hash.) We encourage the world to use this one.

How should namespacing work?

Using namespaces: Accession = {labelset}_{cell_set_hash}

e.g.:

Cluster:34ghbz45 Supercluster:34ghbz45

But we need to enforce within BICAN: no special characters (except '_-.') or spaces in labelset names. Also enforce character limit.

Note CellXGene has the same limits on special characters but no limit on length.

satra commented 6 months ago

perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty. but only if cells are uniquely identified and persisted. alternatively simply going with a unique id and leaving any matching/diffing to a different system may make sense.

dosumis commented 6 months ago

perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty.

That's the idea.

General agreement at the meeting was to go with

{labelset}:{cell_sethash} # might be safer to use `` given need for further namespacing. e.g. Cluster:34ghbz45 Supercluster:34ghbz45

This => a unique identifier for annotated object = a set of cells + labelset.

However, I would like to document use cases & consequences for implementation cleanly first.

Implementing above the hierarchy level => accession changes every time the hierarchy changes. This is slightly tricky for tool builders.

Use cases:

encoding hierarchy independent of labels
- Rearranging a hierarchy means generating new accessions for the cell set whose position has changed in the hierarchy, and all other cell sets above it in the branches it has moved between. We then using these new accessions in relationships that record the changed hierarchy. This is not impossible, but is a significant problem for tool developers to solve. We would at least want to have a simple python library that supported it & probably one in R too.
identifying annotated objects & retrieving object/annotation in the context of multiple versioned taxonomy files.
- A user shares an accession Cluster:34ghbz45 and the name of a taxonomy - perhaps as part of the output of MapMyCells. There is no guarantee that this accession will be present in the latest version of the taxonomy. The user would need the version too and easy access to that version. However, the PURL system can support this.
identifying annotated objects & retrieving object/annotation in the context of a knowledge graph / DB
- A DB/KG could potentially track annotations through different versions so that any incidence of Cluster:34ghbz45 can be found - without having to know versions. This would always retrieve the same object (set of cells) but might retrieve multiple versions of annotations.

satra commented 6 months ago

do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).

dosumis commented 6 months ago

do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).

Yes to both.

cellannotation / cell-annotation-schema

Add hash accession support #71