cellannotation / cell-annotation-schema

General, open-standard schema for cell annotations
9 stars 1 forks source link

Add hash accession support #71

Open dosumis opened 6 months ago

dosumis commented 6 months ago

cell sets can be uniquely identified by generating a hash from a sorted list of the cell_ids that define the set, given a sufficiently powerful hashing algo to precent clashes.

Having this unique identifier for cell sets is useful for linking and resolution* - It is also useful for change management - if annotations are associated with a cell set whose composition has changed, a hash can act as a quick warning.

TBD: Use as primary ID or keep as separate cell_set_ID ?

It is tempting to use such hashes as accessions. However, annotations are attached to a combination of cell set + labelset. It is possible for a single dataset to have the same dataset represented in different labelsets. To achieve uniqueness in this context we would need either to (a) add labelset into the hash input - in which case we lose the ability to generate from sets of cell_ids in any context and changing labelset name would change the output. (b) extend the output with sometime associated with the labelset - e.g. rank. It would then be possible to derive the cell_set hash by string parsing, but this feels hacky - and relies on all labelsets having a rank.

*cas_tools already supports hash generation from IDs using the blake7 algo to produce a compact string while reducing the danger of clashes to neglible levels. see hash_demo.py

dosumis commented 6 months ago

Decision:

We have 2 accessions:

How should namespacing work?

Using namespaces: Accession = {labelset}_{cell_set_hash}

e.g.:

Cluster:34ghbz45 Supercluster:34ghbz45

But we need to enforce within BICAN: no special characters (except '_-.') or spaces in labelset names. Also enforce character limit.

Note CellXGene has the same limits on special characters but no limit on length.

satra commented 6 months ago

perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty. but only if cells are uniquely identified and persisted. alternatively simply going with a unique id and leaving any matching/diffing to a different system may make sense.

dosumis commented 6 months ago

perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty.

That's the idea.

General agreement at the meeting was to go with

{labelset}:{cell_sethash} # might be safer to use `` given need for further namespacing. e.g. Cluster:34ghbz45 Supercluster:34ghbz45

This => a unique identifier for annotated object = a set of cells + labelset.

However, I would like to document use cases & consequences for implementation cleanly first.

Implementing above the hierarchy level => accession changes every time the hierarchy changes. This is slightly tricky for tool builders.

Use cases:

satra commented 6 months ago

do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).

dosumis commented 6 months ago

do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).

Yes to both.