dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
7 stars 10 forks source link

Unclear scope of session_id, tissie_id, cell_id, etc #141

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

Came up in a discussion with @bendichter that there is no mandate on either those (and actually subject_id as well) should be defined globally, per dataset, per subject, and/or per session, etc. Global level would have been the best but infeasible. Also it would likely to require longer IDs, thus making organized filenames inconveniently long.

Some of them seems to have inherent upper scope limit, at least in the prevalent majority of the cases, eg tissue sample with the same id can't come from different subjects. Cell - from different tissues. Situation is not as clear at subject or session level. Is subject id defined for the dataset, is centralized to lab or some other entity - archive (eg guid in nda), country (SSN), registry (orcid), or global (yet to establish interplanetary registry). Session - are two subjects with the same session id (eg based on date) participate in joint behavioral experiment, or session id is just to disambiguate within subject? Is cell id, to disambiguate within session or subject or tissue?

I think, per each of the IDs we (or nwb? bccn?) should establish clear default scope, but we might want allow to explicitly state or modify it in dandiset.yaml .

.... more later, need to run

satra commented 4 years ago

for biccn, this will be eventually coordinated by the bcdc.

but in general, we should separate the notion of an identifier for an entity from it's names or relations. internally, every object should have uuid. but this may be mapped to other ids (e.g., lab, bids, nda). consider bids, where individual subject names are only unique within a dataset.

indeed provenance model can be used to relate these uuids, and properties of these objects can be used to relate to lab/dataset/consortium level ids.

yarikoptic commented 4 years ago

consider bids, where individual subject names are only unique within a dataset.

where is that stated? Could be center wide id, allowing easy reuse of data across studies within center.

satra commented 4 years ago

sorry, i should have been clearer. you can reuse IDs, but there is no formal mechanism to link IDs across datasets. thus the IDs are from a dataset perspective, only within the dataset. for example, there is no:

datasetX:idY isSameAs datasetZ:idV

the only link in bids is within the folder structure. an isolated file simply has an id.

bendichter commented 4 years ago

Thanks for bringing this up, Yarik. Good job explaining the issue. As far as I can tell, neither NWB nor DANDI enforces ID scopes at this point. Currently here are the common (implicit) scopes I'm seeing from labs:

global > lab > dandiset > subject > session

id: scope subject: lab session: lab (usually date-based) cell_id: dandiset tissue: subject or dandiset (where present) slice: subject (where present)

We haven't dealt with subjects that are recorded in the same session, so we don't currently have a convention for that.

As you can see, there really isn't any consistency here. IDs are not following their minimal scope, nor the maximal scope, and while these are the most common, there will certainly be differences in scope in new datasets if we do not specify this.

We can broaden the scope by concatenating IDs. For instance, if subject_id has scope dandiset and tissue has scope subject, we can make a new tissue id that has scope dandiset with [tissue_id]-[subject_id]. We can narrow the scope by coining new IDs, unique only within the desired scope.

In NWB, every container and dataset (but not attribute) is assigned a UUID, but these will be unique for each NWBFile, so I don't think that's going to help us here.

satra commented 4 years ago

remember for DANDI there is human data involved and for some datasets will have GUIDs, but this can really be handled through metadata linkages.

slice_uuid isPartOf tissue_uuid .
tissue_uuid isPartOf subject_uuid .

session_uuid is orthogonal to the partonomy and is more about the provenance of some entity and related activity.

cell_uuid wasGeneratedBy session_uuid .
cell_uuid a Entity .
session_uuid a Activity .

a uuid can have many labels/ids

subject_uuid isSameAs some_other_uuid . 
subject_uuid alt_id "ds:000001::sub:001" .
subject_uuid alt_id "org:allen::sub:001" .

perhaps i am missing something or this is being constrained not from an information perspective, but the current technologies being used.

yarikoptic commented 2 years ago

Note: I am moving to dandi-schema where I think such questions/discussions belong