the subject / sample question

lzehl commented 3 years ago

This task force is coodinated by: @chrisvdt and @lzehl Instructions for contributing: The first comment provides description of the issue and meant to trigger discussions and collect possible solutions or concrete ideas/aspects that are important to consider. It's content can be adopted over time. For change or extension requests please get in touch with @lzehl. The ongoing discussions around this issue can be held through comments (as usual).

GENERAL TOPIC: While BIDS at the moment focuses on living (human) beings as a whole (in BIDS defined as subject or participant), neuroscience in general can be conducted on any living or dead (human or non-human) being as a whole ("subject") and any possible living or dead tissue sample extracted of that being ("tissue sample"). A being itself can reach from human to animal to a single-celled organism. How would a BIDS extension look that covers all these subject / tissue samples?

Other efforts to coordinate with considering this topic:

microscopy BEP (https://bids.neuroimaging.io/bep031)
SPARC (contact: @tgbugs)

THE FOLDER HIERARCHY ISSUE: The classical BIDS model foresees a hierarchical folder structure with an inheritance principle for metadata associated on each hierarchical level. For the raw data on living (or diseased) human beings (subjects) that means:

rawdata/ ....sub-(label)/ ........ses-(label)/ (optionally neglected if there is only one session) ............(data-type)/ (e.g., func, anat, dwi)

In order to be on the same page here the definitions from BIDS for some terms:

Dataset - a set of neuroimaging and behavioral data acquired for a purpose of a particular study. A dataset consists of data acquired from one or more subjects, possibly from multiple sessions.
Subject - [sub-(label)] a person or animal participating in the study. Used interchangeably with term Participant.
Session - [ses-(label)] a logical grouping of neuroimaging and behavioral data consistent across subjects. Session can (but doesn't have to) be synonymous to a visit in a longitudinal study. In general, subjects will stay in the scanner during one session. However, for example, if a subject has to leave the scanner room and then be re-positioned on the scanner bed, the set of MRI acquisitions will still be considered as a session and match sessions acquired in other subjects. Similarly, in situations where different data types are obtained over several visits (for example fMRI on one day followed by DWI the day after) those can be grouped in one session. Defining multiple sessions is appropriate when several identical or similar data acquisitions are planned and performed on all -or most- subjects, often in the case of some intervention between sessions (for example, training).
Data type - [(data-type)] a functional group of different types of data. BIDS defines eight data types: func (task based and resting state functional MRI), dwi (diffusion weighted imaging), fmap (field inhomogeneity mapping data such as field maps), anat (structural imaging such as T1, T2, PD, and so on), meg (magnetoencephalography), eeg (electroencephalography), ieeg (intracranial electroencephalography), beh (behavioral). Data files are contained in a directory named for the data type. In raw datasets, the data type directory is nested inside subject and (optionally) session directories.

[Note that we do not have to stick to those definitions, but if we vary from them we should explicitly state it to avoid misunderstandings in the discussions.]

To trigger the discussions: Let us assume we generalize the definition of a "subject" being the "thing" that is studied (a whole species [living or dead] or any part of a species [living or dead]). If we strictly follow the inheritance principle of BIDS, the following structure could be assumed for several use cases:

rawdata/ ....sub-(label)/ (e.g., a mouse) ........ses-(label)/ (optionally neglected if there is only one session) ............(data-type)/ (e.g., func, anat, dwi) ........sub-(label)/ (e.g., the whole brain of that mouse) ............ses-(label)/ (optionally neglected if there is only one session) .................(data-type)/ (e.g., anat, dwi, fixation) ............sub-(label)/ (e.g., a slice of that brain of that mouse) ................ses-(label)/ (optionally neglected if there is only one session) .....................(data-type)/ (e.g., func [e.g. patch-clamp], anat [e.g., histology]) ................sub-(label)/ (e.g., a biopsy of that slice of that brain of that mouse) ....................ses-(label)/ (optionally neglected if there is only one session) .........................(data-type)/ (e.g., RNA analysis)

Questions: 1) What are the advantages of such a solution and what are the disadvantages? 2) Is it wise to group whole species together with extracted parts under one common term (e.g. "subject" or "specimen")? 2.1) In how far do relevant metadata differ between a "subject" or a "tissue sample" (as defined in the general description)? 3) In how far does the "provenance" of a subject (as whole or as part) need to be covered in the repository/folder structure? 3.1) Could a subject-folder also be interpreted as a group / collection and still allowing the identification of a member of that group/collection? 4) How would a solution for an opposite approach look like (keeping everything in a flat structure)? 4.1) How would that affect the metadata storage / concept of BIDS?

[NOTE: Please get in touch with @lzehl to request changes / extensions for this first comment to keep it up-to-date with the result of the discussions in the remaining comments.]

chrisvdt commented 3 years ago

In the openMINDS_core - full metadata model for a dataset slide Lyuba showed us, there was a tag named ProtocolExecution linked to study target. In our chat during the last meeting I commented on this and Tom Gillespie noted, I quote ": protocol execution is the process of an experimenter carrying out an experiment" In BIDS, this would be an experimental session, and the study target would be the subject.

As far as I understand from ontological descriptions in general, names are defined by their relationships, A Parent has a child. A child can be anything even a parent. But a child is a child because it has a Parent. So a ProtocolExecution/Experimental Session has a study target /or Subject on which the experiment is performed.

As an experimenter, I have no problem calling the subject of my experiment the subject, whether my subject is an animal, tissue/blood sample or whatever. But of course only in the context of an experiment and I think this is how we need to solve this problem about subjects or samples. I record samples from a subject in an experiment. The names depend on the relationship.

apdavison commented 3 years ago

@chrisvdt I would say that sometimes a protocol execution corresponds to an experimental session. However, in general several protocols might be carried out during a single session; conversely, a protocol could span multiple sessions (e.g. training a rat).

@lzehl please could you edit the issue description to explain in more detail what the question is?

apdavison commented 3 years ago

A case study for needing both a subject and multiple samples at different levels would be paired patch-clamp recordings, in which an experimenter records from two neurons simultaneously in a brain slice in vitro, with two different electrodes. We will have metadata for each of the recorded cells (e.g. the electrode solution, access resistance), for the brain slice (e.g. the bath solution), and for the subject (sex, genotype, tracer injection prior to brain slicing, etc.)

satra commented 3 years ago

in dandi we are using the following hierarchy of information for the moment: subject > tissue sample > slice > cell/probe (everything is optional except subject and anything can have a one to many mapping) the one thing that's harder in the current schema is when a cell id, say in a volume image corresponds to multiple slices.

lzehl commented 3 years ago

In the openMINDS_core - full metadata model for a dataset slide Lyuba showed us, there was a tag named ProtocolExecution linked to study target. In our chat during the last meeting I commented on this and Tom Gillespie noted, I quote ": protocol execution is the process of an experimenter carrying out an experiment" In BIDS, this would be an experimental session, and the study target would be the subject.

As far as I understand from ontological descriptions in general, names are defined by their relationships, A Parent has a child. A child can be anything even a parent. But a child is a child because it has a Parent. So a ProtocolExecution/Experimental Session has a study target /or Subject on which the experiment is performed.

As an experimenter, I have no problem calling the subject of my experiment the subject, whether my subject is an animal, tissue/blood sample or whatever. But of course only in the context of an experiment and I think this is how we need to solve this problem about subjects or samples. I record samples from a subject in an experiment. The names depend on the relationship.

Although I think this is a bit off-topic in this issue, I need to correct here a bit the interpretation of openMINDS (sorry that this did not became clear in the documentation; an update there will follow shortly).

The ProtocolExecution is a schema that captures the individual conduction of an experimental or analysis process and therefore has "inputs", "outputs" and "study targets". Here an example:

The brains of two Subjects, or actually the default mode network of two Subjects at specific SubjectStage(s), e.g. at age 40 years (sub-01) and at age 32 years (sub-02), were imaged in a resting-state fMRI experiment. In openMINDS we would register:

two Subjects each connected to the corresponding SubjectStage
two Subject(Stage) specific ProtocolExecutions, each taking one SubjectStage as "input", producing a corresponding FileInstance as "output" and linking to a generic Protocol.
one generic Protocol, linking the used "technique" (fMRI) and "task" (resting state) and the "study target" (default mode network) Differs the "study target" between ProtocolExecutions it can also be specified there. Within openMINDS a Subject (or actual its SubjectStage) cannot be a "study target", but has to be an "input" or "output" to a ProtocolExecution.

Possible "input" to a ProtocolExecution: Subject(Group)Stage, TissueSample(Collection)Stage, FileInstance, FileBundle Possible "output" to a ProtocolExecution: Subject(Group)Stage, TissueSample(Collection)Stage, FileInstance, FileBundle

A "study target" can be attached to a Protocol or a ProtocolExecution. Possible "study targets" (selection): a specific species, a specific cell type, a specific network, a specific receptor type, a specific brain region, etc.

In BIDS this would mean the ProtocolExecution will be at least a session, but could also be more fine-grained than that (e.g. files that belong to one data-type). The Subject (or actually its Stage) would remain the input to that ProtocolExecution.

The "study target" depends on the experiment (technique/task):

for a resting state task it could be the default mode network
for a tracer injection it could be inter-region connectivity
for autoradiography it could be a specific receptor type
etc

@chrisvdt : I hope I could clarify a bit more the openMINDS model part of this? Let me know if you have questions on my summary.

chrisvdt commented 3 years ago

@lzehl : Based on your last description I think I have interpreted study target correctly as an entity that depends on an experiment. Although not the subject or subject stage (because in your model they are inputs), it is something about that subject, in your example the "default network" that is studied.

What if I give this "default network" an id that I can use as a reference to a subject stage or actual subject. I am not an archivist, i'm an experimentalist, my life evolves around experiments, so if i do an experiment on "default network with id" I would usually call this my subject. But lets call it ExSubject for now.

@apdavison : mentioned the invitro slice recording with simultaneously recorded neural activity. (A similar case might be a 2photon imaging sequence with hundreds of simultaneously recorded neurons.) Experimentally it makes no difference how many cells or signals are recorded, I simply need to put my session in one(1) context, which would be the invitro slice as ExSubject.

What is a session in this context? @apdavison : I see that there can be several interpretations of what a protocol execution could be.

In recording timeseries, my view of a session is simply one timeseries recording (which can include various behavioural modalities in separate files) using one stimulus or intervention sequence. I might have to use ExSubject multiple times when I apply different experimental interventions on subsequent recording sessions. Indicating that ExSubject should have a unique id.

So coming back to the questions posed at the beginning;

How would the following tissue sample options fit in such a hierarchical structure? 1.1) a tissue sample that originated from a subject (e.g., an extracted brain) 1.2) a tissue sample collection originating from a subject (e.g., multiple biopsies) 1.3) tissue samples that originate from a tissue sample (e.g., brain slices of an extracted brain) 1.4) artificial tissue samples (e.g., cell lines)

These questions are relevant from an archivist point of view, but not from an experimental view. I think the questions should be. 1.1 is a tissue sample a result collected during an experimental session or is it the that which is studied and collected from during that session. 1.2 What in my experimental session should have a unique id that characterizes both the origin and target of my study.

I am stressing the experimental view, because BIDS and any alternative schema should be applied by experimentalists in the context of their research. Fitting this research in to MINDS is in my view a whole other level of conceptualization. My concern about this arises because I really think we should reduce this schema to its bare essentials, otherwise it will get too complicated.

robertoostenveld commented 3 years ago

for background, in BIDS for human cognitive neuroimaging I would say the hierarchy is:

subject (expressed as directory) - session (expressed as directory) - run (expressed in multiple files) - trial (expressed in a single file and in events.tsv)

SylvainTakerkart commented 3 years ago

Following our discussion during the April 8 2021 meeting of the INCF Working Group on neuroscience data structure, we have reached a consensus within the group to keep working with BIDS' current hierarchy and deal with the information that is not represented in the hierarchy itself (e.g: tissue, slice, cell etc.) by adding them in the metadata, rather than trying to add extra levels in the hierarchy. These extra info (tissue, slice, cell, probe, sample etc.) might be added as new BIDS entities (i.e in the filenames); we will open an issue on the BIDS github repo to discuss this further directly at the BIDS level (rather than discussing them separately in different BEPs).

satra commented 3 years ago

my suggestion would be whenever the new issue is posted to simply have a sample entity and then in the proposed samples.tsv to add a sample_type (block, slice, cell, etc.,.) however, this would require sample entities to be unique to disambiguate those levels. in dandi we add them in names to simplify filename based search, but i wanted to stick them in the metadata. (@yarikoptic made me take them out :) ) but in seriousness this goes more to a broader identifier discussion.

SylvainTakerkart commented 3 years ago

for info, an issue has been opened directly at the BIDS level so that we deal with this in a common manner across BEPs (at the moment, the animal-ephys BEP and microscopy BEP )...

so unless there are specific non-BIDS topics, I suggest we keep on exchanging overthere: https://github.com/bids-standard/bids-specification/issues/779

tgbugs commented 3 years ago

Here is the promised write-up with an overview of the problem space, a potential model, and a review of the trade-offs that I see for BIDS based on my experience implementing and maintaining the SDS and its validation pipelines. I'm also dropping this in https://github.com/bids-standard/bids-specification/issues/779.

https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org

If you have targeted questions or comments you can leave them on this commit. https://github.com/SciCrunch/sparc-curation/commit/c5968b94ca6de568f58acf146fce1f20140c7fcf

lzehl commented 3 years ago

summary of conclusions of the microscopy BEP (by @tgbugs):

suggestion 1: extend the notion of "subject" to include samples (was rejected for now; "subject" has to be an organism)
suggestion 2: introducing a new entity called "sample" (with a "type" attribute identifying the type of sample)

Suggestion 2 is part of the microscopy BEP PR: https://github.com/bids-standard/bids-specification/pull/812

robertoostenveld commented 2 years ago

I just read in the paper A collaborative resource platform for non-human primate neuroimaging on PRIME-RE

Standardization efforts in human neuroimaging have yielded the Brain Imaging Data Structure (BIDS) standard, a widely adopted file-naming and data-structure convention that facilitates data and resource sharing. Whereas a range of tools on PRIME-RE understand data in BIDS format, the compatibility of NHP data with the BIDS structure is not perfect, again due to some unique challenges in NHP neuroimaging that are not present in human neuroimaging. There are, however, ongoing efforts to either expand the BIDS format so that it can better incorporate the idiosyncrasies of NHP data, or create a derivative NHP version that fulfills specific requirements to this research field."

(emphasis mine)

The specific experiences of those authors (probably all from an imaging rather than ephys background) might be relevant to follow up.

INCF / neuroscience-data-structure

the subject / sample question #9