Open dosumis opened 6 months ago
Decision:
We have 2 accessions:
How should namespacing work?
Using namespaces: Accession = {labelset}_{cell_set_hash}
e.g.:
Cluster:34ghbz45 Supercluster:34ghbz45
But we need to enforce within BICAN: no special characters (except '_-.') or spaces in labelset names. Also enforce character limit.
Note CellXGene has the same limits on special characters but no limit on length.
perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty. but only if cells are uniquely identified and persisted. alternatively simply going with a unique id and leaving any matching/diffing to a different system may make sense.
perhaps the hash could simply be a dirty bit detector rather than which specific bits are dirty.
That's the idea.
General agreement at the meeting was to go with
{labelset}:{cell_sethash} # might be safer to use `` given need for further namespacing. e.g. Cluster:34ghbz45 Supercluster:34ghbz45
This => a unique identifier for annotated object = a set of cells + labelset.
However, I would like to document use cases & consequences for implementation cleanly first.
Implementing above the hierarchy level => accession changes every time the hierarchy changes. This is slightly tricky for tool builders.
Use cases:
Cluster:34ghbz45
and the name of a taxonomy - perhaps as part of the output of MapMyCells. There is no guarantee that this accession will be present in the latest version of the taxonomy. The user would need the version too and easy access to that version. However, the PURL system can support this. Cluster:34ghbz45
can be found - without having to know versions. This would always retrieve the same object (set of cells) but might retrieve multiple versions of annotations. do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).
do cells have specific identifiers and is there some relation that identifies which cells belong to which cellset (or whichever grouping that is a collection of cells).
Yes to both.
cell sets can be uniquely identified by generating a hash from a sorted list of the cell_ids that define the set, given a sufficiently powerful hashing algo to precent clashes.
Having this unique identifier for cell sets is useful for linking and resolution* - It is also useful for change management - if annotations are associated with a cell set whose composition has changed, a hash can act as a quick warning.
TBD: Use as primary ID or keep as separate cell_set_ID ?
It is tempting to use such hashes as accessions. However, annotations are attached to a combination of cell set + labelset. It is possible for a single dataset to have the same dataset represented in different labelsets. To achieve uniqueness in this context we would need either to (a) add labelset into the hash input - in which case we lose the ability to generate from sets of cell_ids in any context and changing labelset name would change the output. (b) extend the output with sometime associated with the labelset - e.g. rank. It would then be possible to derive the cell_set hash by string parsing, but this feels hacky - and relies on all labelsets having a rank.
*cas_tools already supports hash generation from IDs using the blake7 algo to produce a compact string while reducing the danger of clashes to neglible levels. see hash_demo.py