Closed mariehbourget closed 2 years ago
thanks @mariehbourget for this ! we clearly need something of this type for BEP032 but havn't had a look at it precisely yet... we will take care of this in the next few weeks with @JuliaSprenger and others...
sample
is unique throughout the dataset, so that if sample-1
exists in sub-01
, there cannot be a sub-02_sample-1
. This is an unusual requirement compared to existing entities. Could you elaborate on the motivation?How does pathology/diagnosis overlap with phenotype?
regarding pathology, this should be annotated at the sample level in that one may have a tumor sample in one case versus on non-tumor location in the same participant. thus pathology goes with sample rather than participant. indeed for human and potentially other species there should be a Dx column for diagnosis. note that this could also vary by date and could therefore be in the sessions.tsv rather than participants.tsv. more generally there should be a conversation about inheritance of properties from participants to sample, when all samples share those properties.
also the same samples could be used in multiple sessions. hence having a mechanism to consolidate that would be necessary, and hence samples are similar to participants in that sense. in many cases, samples, rather than participants are often the primary entity in studies. and in keeping with bids it was discussed in the subgroups that a dataset should connect a sample to a participant even if the participant details are unknown.
btw, samples could also be used in human MR scans (e.g., left/right hemi ex vivo, brainstem, etc.,.), and hence samples should be considered at a generic concept in bids, rather than specialized just for microscopy/ephys.
How does pathology/diagnosis overlap with phenotype?
@dorahermes - one example use case could be something like a diagnosis column that says Major Depressive Disorder (or ICD10 code), but the diagnosis itself could have been attached to a phenotype file(s) (e..g, KSADS, HAMD, etc.,.) or simply a clinical evaluation which may not have a phenotypic assessment in many cases.
- Sample entity: As this is written, it seems to require that
sample
is unique throughout the dataset, so that ifsample-1
exists insub-01
, there cannot be asub-02_sample-1
. This is an unusual requirement compared to existing entities. Could you elaborate on the motivation?
The intention is not to require "unique" sample_id across a dataset. We think people should be able to have the same sample_id for two different subjects as you are suggesting. In samples.tsv , that would give something like this: |
sample_id | participant_id | sample_type |
---|---|---|---|
sample-1 | sub-1 | tissue | |
sample-1 | sub-2 | tissue | |
sample-1 | sub-3 | tissue | |
sample-2 | sub-3 | tissue | |
sample-3 | sub-3 | tissue |
So the "unique" identifier is the combination of sample_id
and participant_id
, and not sample_id
alone.
So the "unique" identifier is the combination of
sample_id
andparticipant_id
, and notsample_id
alone.
this sounds ok to me! we should just give an example that is a bit more telling that just "sample-1", "sample-2" to be immediately understandable just by looking at it... question: what do experimenters use as user-friendly ids for their samples?
We should also discuss if (and how) we want to encode an additional identifier when a sample is derived from another sample (e.g., a slice is derived from a block of tissue).
we had discussion in our last BEP32 meeting about the possibility of adding several entities ('sample', but also 'slice' and 'tissue')... I don't want to deviate the goal of this thread, but maybe we should have this discussion globally here? I mean, asking ourselves how many entities should be added and which ones? or whether adding just the 'sample' entity and dealing with everything else through the 'sample_type' can cover all the targeted usecases? with this latter solution, indeed, the quoted question (i.e "how do we encode the fact that a slice is derived from a block of tissue") should be addressed!
small detail: although it is just an example / suggestion, the current specification mentions "group" as one of the column in participants.tsv
, so if "pathology/diagnosis" finds its way in participants, this example might need to be amended or clarified otherwise this could lead to some confusion.
If sample labels can be reused across subjects, I think we can do the following:
1) Drop the participant_id
column.
2) Follow the inheritance principle.
If the sample labels are the same across subjects, a global samples.tsv
would provide the information needed. If they vary across subjects, then a set of sub-<label>/sub-<label>_samples.tsv
files can be created.
regarding pathology, this should be annotated at the sample level in that one may have a tumor sample in one case versus on non-tumor location in the same participant.
I think @satra's suggestion here is good, and that making pathology
a column in samples.tsv
would make resolve the concerns I had above.
Indeed for human and potentially other species there should be a Dx column for diagnosis. note that this could also vary by date and could therefore be in the sessions.tsv rather than participants.tsv.
Yes, I think diagnosis as a session-level variable makes sense. As an aside, I don't think we have a principle that says how to do session-level variables for single-session studies that omit the ses-<label>/
directory, but that would be worth clarifying if we add variables that are useful in single-session contexts.
@SylvainTakerkart
we had discussion in our last BEP32 meeting about the possibility of adding several entities ('sample', but also 'slice' and 'tissue')...
We had similar discussions in BEP031 for other additional entities. The way we handled this so far is based on what entities are needed to distinguish between 2 different files of a same subject. For example, metadata like “sample_type” (primary cell, tissue, etc) is a unique attribute of the sample itself and would not change for a same subject_sample. In those cases, we think the information would be best encode in metadata and not in the filename.
how do we encode the fact that a slice is derived from a block of tissue
I would suggest adding a derived_from
column in samples.tsv
to cover this. Ex: sample-X
from sub-1
is a block of tissue imaged. Then sample-X
is sliced in slices named sample-x1
, sample-x2
, sample-x3
by the experimenter and imaged. The link between the samples could be in samples.tsv
as:
sample_id | participant_id | sample_type | derived_from |
---|---|---|---|
sample-X | sub-1 | tissue | n/a |
sample-x1 | sub-1 | tissue | sample-X |
sample-x2 | sub-1 | tissue | sample-X |
sample-x3 | sub-1 | tissue | sample-X |
@effigies
If sample labels can be reused across subjects, I think we can do the following:
- Drop the participant_id column.
- Follow the inheritance principle.
If the sample labels are the same across subjects, a global samples.tsv would provide the information needed.
I’m not sure to understand you on this.
Ex: 2 subjects (sub-1
and sub-2
) have a sample named sample-1
. However, the metadata of the sample-1
from sub-1
is not necessarily the same as for the sample-1
from sub-2
. I don’t understand the utility of a global file without the participant_id
column, as it would not make the distinction between the two.
@mariehbourget in the SPARC Dataset Structure we also include a "derived_from" (i.e. wasDerivedFromSample) in the samples metadata file: https://docs.google.com/presentation/d/1EQPn1FmANpPsFt3CguU-JOQVMMlJsNXluQAK_gb2qVg/edit#slide=id.p9
@mariehbourget
Ex: 2 subjects (
sub-1
andsub-2
) have a sample namedsample-1
. However, the metadata of thesample-1
fromsub-1
is not necessarily the same as for thesample-1
fromsub-2
. I don’t understand the utility of a global file without theparticipant_id
column, as it would not make the distinction between the two.
If the metadata for sample-1
is the same across subjects, it can be placed in a global file. If it differs, it can be placed in sub-1/sub-1_samples.tsv
and sub-2/sub-02_samples.tsv
.
I think it'd be great to hear from @tgbugs here... if we manage to handle all this consistently across BEP31, BEP32 and SPARC, that'd be fantastic to facilitate future inter-operability... (as was just said in the BEP31 meeting ;) )
From discussions at the meeting, I think the global samples.tsv
might be compelling for this use case. My concerns are primarily aesthetic, preferring the file location to match the objects being described, but if doing it that way would require everybody to reconstruct the global table in software, it's not worth forcing.
Here is my write-up with an overview of the problem space, a potential model, and a review of the trade-offs that I see for BIDS based on my experience implementing and maintaining the SDS and its validation pipelines. I'm also dropping this in https://github.com/INCF/neuroscience-data-structure/issues/9.
https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org
If you have targeted questions or comments you can leave them on this commit. https://github.com/SciCrunch/sparc-curation/commit/c5968b94ca6de568f58acf146fce1f20140c7fcf
@effigies your concerns about forcing the reconstruction of the global table are well founded and I discuss the trade-offs in detail.
Thank you everyone for your comments, suggestions and feedback!
@tgbugs thank you very much for your insightful comment in https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org. I am responding here so that the discussion stays centralized within a single issue thread (otherwise it is difficult to keep a clear track history of the discussion).
Regarding the subject
entity: We are strongly in favor of keeping the current BIDS definition of subject
i.e. a person or animal participating in the study
. This is important to ensure compatibility between BIDS modalities (ex: a study with both microscopy and MRI where the subject must refer to the same organism)
Regarding “collective participants” such as “populations” or “pool”: This may go beyond the scope of this discussion. Moreover, based on our multiple meetings with the different groups who have given us feedback and exposed typical use case scenarios, this scenario currently falls in the 20% of the 80/20 BIDS principle. It seems reasonable to think that the label of the sub-<label>
or sample-<label>
could be used to describe this particular case (e.g. sub-pool01
)
Regarding different “experimental group” (e.g. control and treatment): group
is already mentioned as an example field for this purpose in participants.tsv
in the current stable version of the BIDS standard (1.6.0).
Regarding the sample
entity: Our proposition adds a single new entity sample
to the file name to describe any sample_type
(tissue, primary cell, etc). The sample_type
is described in samples.tsv
as it is an attribute of the sample
and not necessary to distinguish between two files of the same sample. The advantage is to have a simple structure where each file has a unique identifier (e.g.: sub-1_sample-1
). Because this prefix is in the file name, there would be no ambiguity between files if 2 subjects have the same sample numbers (e.g. sub-1_sample-1
and sub-2_sample-1
).
Regarding file system: All files from the same subject remain in the sub-XX
folder and not into additional nested folders. This structure avoids the added complexity and pitfalls of a nested folder structure mentioned by @tgbugs.
Regarding metadata files: The proposed solution with the sample
entity requires only one additional file (samples.tsv
) at the root of the dataset. The column participant_id
is common to both participants.tsv
and samples.tsv
to join the tables. It seems reasonable to use two different files to distinguish between attributes of the subject
and attributes of the sample
.
Specific entity in filename should only be used when there is a need to distinguish two files from a same subject and sample (e.g. for microscopy: session
, stain
, chunk
, run
)
participants.tsv
should be reserved for subjects’ attributes (e.g. age
, sex
, species
, diagnosis
)
samples.tsv
should be reserved for samples attributes (e.g. sample_type
, derived_from
, pathology
)
Hi @jcohenadad, thanks for taking a look. Here are my thoughts.
With respect to compatibility between modalities, the only thing that matters in that case is the distinct identity of the participant, not distinct identifier type.
The underlying conceptual types for the entities referred to by the
identifiers remains distinct (organismal subjects are not biological
samples). The type of the identifier in how it defines a namespace
for uniquely identifying distinct individuals is different from that
conceptual type. I am suggesting to extend the sub-
identifier type
to be used to name anything in a BIDS dataset that has data about it.
This is consistent with how sub-
is used in BIDS.
The underlying conceptual type of course must be retained, it couldn't be otherwise, the key is to deconflate the conceptual type from the identifier type. This means making the conceptual participant type explicit in the schema rather than implicit in the identifier type.
As such, my suggestion does not prevent the ability to distinguish metadata for participants subjected to different modalities, because it is about the type of the identifier, not the identity of the identifier. If I have sub-1 that was subjected to both microscopy and MRI, then both have the same identifier because there was only one participant that had measurements made on it [cue the Far Side "It's a mammoth" cartoon]. If there was a sample derived from sub-1 that was subjected to microscopy then I would simply call it sub-2. The identity of the organismal subject and the sample subject are thus differentiated, without adding complexity to the model by forcing them to have different identifier types.
I can understand the desire force the conceptual types (e.g. of
subject vs sample) onto the identifiers, but this doesn't actually buy
anything and only increases complexity, because sub-
can already
distinguish sub-1 from sub-2 in the modality use case. Yes, we can
also distinguish sub-1 from sam-1, but why add that complexity?
The suggestion to use sub-pool01 will not work unless the meaning
of sub-
is extended in the way I propose, because the metadata
requirements for pools cannot be enforced correctly unless there is a
way to distinguish between collective and atomic participants
independent of their identifier type prefix.
Even if this is a 20% use case, it is one that must be considered when designing the 80% case because of the fundamental differences in what can be required in the metadata of atomic vs collective entities.
If BIDS cannot distinguish between those cases, then it will become an issue down the line, because there will be datasets e.g. in SPARC that we will not be able to convert to BIDS in a way where we can correctly enforce the structure of the metadata. For consortia looking to adopt a standard that they can use across all datasets, this will cause BIDS to continue to be unattractive.
Yep. That section was included for completeness and for other audiences as BIDS already does this.
My suggestion is that rather than adding this complexity to only samples, that it be added to all participants. Eventually it will be needed for everything, and if it is only applied to samples then there will be duplicated effort and duplicate schemas down the line.
Furthermore and more importantly, I strongly warn against allowing sample identifiers to be reused across subjects. We just made a change to prevent this in the SPARC data structure because of the numerous implementation, usability, and ambiguity issues that it causes.
Sample IDs should at the very least be unique per dataset. Speaking from experience, allowing non-unique sample ids that must be composed with a subject identifier to for a primary key is a bad idea. I'm happy to elaborate on the years of headaches that it has caused.
I should note that nesting files with sam-x in their name encounters the exact same issue as nesting folders and thus the problem remains.
In theory this could be mitigated to some extent by forcing users to always include the subject id from which a sample was derived but as I mention above, creating composite primary keys from subject sample pairs is not a good idea. The problems that it causes down the line are simply not worth the trade-off of being able to detect that a sample file has been put in the wrong subject folder. Further, what happens in cases with multiply derived samples? sub-1-sam-1-sam-1?
There are many other lurking issues here and I suggest avoiding them entirely.
It is reasonable to use two different files to represent two different schema. However, the problem is that there are going to be more than two files in the future if the principle is "a new file for every new schema and/or every new participant type."
Having a sparse tabular schema is also a reasonable way to solve this problem, which is significantly more attractive because BIDS accepts JSON as a metadata format, where the sparseness is not an issue. It also has the benefit of only requiring the modification of the schema to add a new set of fields for a new conceptual participant type, rather than the addition of a whole new file. Some back of the napkin math suggests that creating a new table per participant type will wind up with BIDS eventually having well over a dozen different files one for each of the participant types that I enumerate.
While adding additional files to avoid sparseness may seem reasonable if there is only a single new table, it does not seem reasonable if there will eventually be multiple new files and tables.
In a sense splitting samples.tsv into its own file is trying to solve a user interface problem in the data model. I suggest that BIDS not try to solve the user interface problem here and avoid multiplying specialized metadata files.
With regard to the suggestions.
Also re: https://github.com/INCF/neuroscience-data-structure/issues/9
Hi @tgbugs, thank you for the clarifications.
This discussion is touching on some of the core decisions made by the BIDS community. It would be great if some of the BIDS maintainers/steering could chip in as well @effigies @robertoostenveld.
I am suggesting to extend the sub- identifier type to be used to name anything in a BIDS dataset that has data about it. This is consistent with how sub- is used in BIDS.
If there was a sample derived from sub-1 that was subjected to microscopy then I would simply call it sub-2. The identity of the organismal subject and the sample subject are thus differentiated, without adding complexity to the model
We were advised by the BIDS steering group (@robertoostenveld) to not extend the definition of the subject
entity. In that regard, the subject
definition of "a person or animal participating in the study" seems important to the BIDS community to preserve consistency across modalities.
We’ve tried different configurations in the early development of the microscopy BEP and we agree that adding many different entities to describe different use cases adds undue complexity to the model. Therefore, we proposed the sample
entity which would be used for the different specimen types you mentioned. Our idea is not to add an entity to every possible case but to use sample
as the entity for different sample_type
such as “whole organ”, “tissue”, “cells”, etc.
The advantages are that it retains the definition of subject
without adding multiple layers of complexity to the scheme, and covers all atomic participant types.
The suggestion to use sub-pool01 will not work unless the meaning of sub- is extended in the way I propose, because the metadata requirements for pools cannot be enforced correctly unless there is a way to distinguish between collective and atomic participants independent of their identifier type prefix. Even if this is a 20% use case, it is one that must be considered when designing the 80% case because of the fundamental differences in what can be required in the metadata of atomic vs collective entities.
As far as we know, the current BIDS specification does not cover explicitly “collective” participants, hence the suggestion to name the subject
with the pool name in the absence of standardization. Again, it may be out of scope for the current issue. With that being said, I will let the BIDS community chip in if there are plans for that in the future.
I strongly warn against allowing sample identifiers to be reused across subjects. We just made a change to prevent this in the SPARC data structure because of the numerous implementation, usability, and ambiguity issues that it causes. Sample IDs should at the very least be unique per dataset.
I should note that nesting files with sam-x in their name encounters the exact same issue as nesting folders and thus the problem remains. In theory this could be mitigated to some extent by forcing users to always include the subject id from which a sample was derived but as I mention above, creating composite primary keys from subject sample pairs is not a good idea. The problems that it causes down the line are simply not worth the trade-off of being able to detect that a sample file has been put in the wrong subject folder.
We understand your concerns in cases where the subject
from whom the sample is derived from would not be explicit. In this BEP, the derivation of a sample from a subject is enforced in the filename itself. An individual file will always have both the subject_id
and the sample_id
within its name. So the composite key of sub-sample is not only present in metadata but it corresponds directly to a unique filename (nested or not, misplaced or not).
From an experimental point of view, it also makes sense for people to name their samples the way they want for the same subject without having to take into account sample_id
from previous acquisitions on other subjects. In addition, it is usual in BIDS to deal with the same key-value pair across a dataset with other entities such as session
. Enforcing a unique sample_id
would be an unusual requirement. Again, I would appreciate it if someone from the core BIDS could chip in (@effigies), as I am uncomfortable speaking as a porte parole for BIDS strategic decisions.
Further, what happens in cases with multiply derived samples? sub-1-sam-1-sam-1?
This was addressed earlier in the thread where we suggested to add a derived_from
column in the samples.tsv
file. There is always only one instance of sample-<label>
in the filename.
It is reasonable to use two different files to represent two different schema. However, the problem is that there are going to be more than two files in the future if the principle is "a new file for every new schema and/or every new participant type." [...] Some back of the napkin math suggests that creating a new table per participant type will wind up with BIDS eventually having well over a dozen different files one for each of the participant types that I enumerate.
As mentioned earlier, the addition of the sample
entity and samples.tsv
file would already cover new participants type (at least “atomic”), so we are not worried about dozens of files being added in the future.
Hi @jcohenadad, I'll leave some thoughts while awaiting for responses from others. Since the discussion has strayed into cross BEP and core BIDS territory, this is understandable.
to not extend the definition of the subject entity
Absolutely. However, I wonder if that suggestion was made in a context where identifier type and conceptual type were conflated. Retaining the definition of subject while extending the scope of sub-
should be possible by adding (initially) a separate definition for sample and a subject type
or participant type
column to participants.tsv. This is a unifying and regularizing generalization of the rather awkward sample type
(I say this also having the same awkward sample type
in the SDS schema that I maintain).
From an experimental point of view, it also makes sense for people to name their samples the way they want for the same subject without having to take into account
sample_id
from previous acquisitions on other subjects
However, from a data sharing point of view, they probably should be taken that into account. There are countless sample-1
s in the world, and having to carry around a composite primary key of dataset-id
, subject-id
, and sample-id
without any way to reduce them to a single unique identifier for the individual participant seems like it will induce complexity on any implementations of BIDS in the future. Furthermore it complications communication about samples because unique sample ids cannot be generated without the subject id to qualify them unless all communicating parties agree on the convention for converting composite primary keys into unique ids.
Relevant to a later point, the generalization of this reasoning is that participant-1-participant-1-participant-1-participant-1
should be allowed as an identifier because each prior participant is distinct and it should be up to the experimenter how to identify their participants. Part of the current BEP tries to deal with this by forcing sample ids to be unique if they were derived from another sample, however the isDerivedFrom
relationship applies with domain and range subject and sample in addition to sample and sample, so the lack of enforcement of unique identifiers for samples derived from subjects is inconsistent with respect to the isDerivedFrom
relationship.
Enforcing a unique
sample_id
would be an unusual requirement.
But according to the proposal this in fact already required for samples derived from other samples.
This was addressed earlier in the thread where we suggested to add a
derived_from
column in thesamples.tsv
file. There is always only one instance ofsample-<label>
in the filename.
There are many cases where samples and not subjects are shipped from one lab to another and then from the shipped samples further samples are derived. That is to say, there are labs for which someone else's sample is their subject.
If we were to apply the logic articulated above for subjects, the experimentalists should likewise not have to care about the fact they derived one sample from another, so long as they keep track of which sample they derived it from and thus that sub-1-sam-1-sam-1
should be allowed (re: infinitely nested participant-1
).
Requiring different practices for identifier generation due to an arbitrary distinction between subject and sample (is a cadaver a sample?) seems like a design flaw. The restriction that only sample ids must be unique and enforcing that only on derived samples but not on samples derived directly from subjects*
would significantly complicate the underlying data model and ontology.
*
This isn't actually the requirement, it is more that all transitively derived samples from the same subject have to have unique identifiers. This gets extremely messy if you start deriving samples from populations of subjects because now the samples probably have to be uniquely identified up to the population not the subject, so the generalization of the uniqueness would require further specification (and thus complexity) in the future to correctly deal with such cases.
there are presently 4 explicit generic levels over which the acquisition of "data" can be iterated. I won't summarize the definitions here, but they can be found on https://bids-specification.readthedocs.io/en/stable/02-common-principles.html
There are also multiple domain specific levels over which the acquisition of "data" can be iterated. For example over multiple voxels in fMRI, or multiple channels in EEG, or multiple timepoints (in either type of data). For MRI there can also be multiple echo's, or multiple contrast enhancing agents, or tracers.
The idea from @tgbugs to "extend the sub- identifier type to be used to name anything in a BIDS dataset that has data about it" leads to the question: why would you not extend the meaning of session, or scan, or run instead?
Or should one be allowed to do sub-ses1scan1run1
and use only sub-
for whatever thing that repeats? Changing the entities that represent iterations of data acquisition would be technically possible, but breaks the meaning of those entities and hence would better fit BIDS 2.x (considering semantic versioning, and hence 1.x and 2.x being incompatible).
it might be worth splitting this issue into two (or three)
_sample-<label>
entity (IMHO quite straightforward)The last two relate also to "stimuli BEP" wannabe issue (see e.g. https://github.com/bids-standard/bids-specification/issues/751#issuecomment-820800800 I also generalize "similarly") and IMHO orthogonal issues to the first one ("samples" entity) and interrelated within since with reordering you would get top level "
As for the last one -- we could gain "scans.json
(#789) and even sessions.tsv/.json
at top level could be useful (e.g. to provide characteristics for e.g. "preoperative", "postoperative" sessions etc) independent of top "iteration" level (currently fixed to sub
).
@robertoostenveld thank you very much.
I think that BIDS 2.X is probably the right venue for my suggestions. Given the constraints on 1.X. In that context I only have one suggestion for this thread, which is to require that sample identifiers be unique per dataset not per subject.
why would you not extend the meaning of session, or scan, or run instead?
The only reason would be if there was a required metadata structure that was associated with some experimental process that could not be capture at one of those levels, or if there were more levels that were required. Otherwise the only reason would be because someone doesn't like the naming of the three levels.
In SPARC we have called the abstraction of those three into a single term performance
or protocol execution
variously. It corresponds to the performance of a protocol aka the carrying out of some experimental process. The distinction between session, scan, and run have to do with the particular nesting of repeated structure that is common to many MRI experiments, and which is shared with a variety of other modalities beyond MRI.
For the most part these don't need to be extended because they are distinct only in how they are named and in that they support 3 levels of repeated structure. There might be some experimental designs that need slightly more expressivity, or that might need/want to associate slightly different metadata with a particular repeated process, in which case the abstracted solution might help.
@yarikoptic I think the 3 can be broken up as you suggest, with a note that there is an interaction between _sample-<label>
and participants.tsv
depending on what uniqueness constraints are required.
I'm also in favor of addressing these issues step by step, this would fit the needs of development of BEP32 (which are strongly overlapping with the ones of BEP31, if not strictly identical)! and the first step (addition of the sample entity) will already allow us to move forward!
what's the next step?
I would think a PR for "addition of _sample-
I would also file a separate issue (or better even a PR) suggesting additional (RECOMMEND) columns to participants.tsv/.json .
Thank you everyone for your feedback! As suggested by @yarikoptic and as discussed in today’s BEP031 meeting, we will move forward with separate PRs, starting with the “Addition of sample entity”.
Hi everyone! The first PR (#812) for the addition of the sample entity is now open.
Closing this since #816 is now merged
Context and motivation
Hi BIDS community!
As part of the development of the Microscopy BEP (BEP031), we want to add a new
sample
entity to BIDS. Thissample
entity was introduced in order to distinguish different tissue samples from the same subject.The
sample
entity may also be used by the Animal Ephys BEP (BEP032 @SylvainTakerkart) and could benefit other modalities as well.This issue aims to start a discussion about the details of the sample entity between the 2 BEP groups and with the BIDS community. It will also facilitate the breaking down of BEPs in smaller modules by adding the sample entity as a separate PR.
Definition of the sample entity
To ensure compatibility with BIDS other modalities, the
subject
entity should correspond to the participant (e.g. a human, a mouse, etc). To identify multiple tissue samples from the same subject, we define the sample entity in BEP031 as:It is positioned after the optional
session
entity in the filename:samples.tsv file
In BEP031, a samples.tsv file was added at the root of the dataset along with participants.tsv.
The samples.tsv file would have 2 required columns:
sample_id
: corresponding tosample-<label>
of the filenameparticipant_id
: corresponding tosub-<label>
of the filenameAnother column
sample_type
was also suggested as required:sample_type
: kind of sample from ENCODE BiosampleTypeWe should also discuss if (and how) we want to encode an additional identifier when a sample is derived from another sample (e.g., a slice is derived from a block of tissue).
participants.tsv file
As part of the subject vs. sample definitions, we would also like to add 2 columns to the participants.tsv file:
species
: string corresponding to the Binomial species name from NCBI Taxonomy, required when different from “Homo sapiens” We think species should be in participants.tsv and not samples.tsv as it is an attribute of the subject and not the sample.pathology
: required when different from “Healthy” In that case, pathology could be in either participants.tsv or samples.tsv as appropriate (e.g. healthy and non-healthy biopsy samples from the same subject).Examples
File hierarchy and naming:
participants.tsv:
samples.tsv: