Open SylvainTakerkart opened 4 years ago
We can of course (and should) get inspiration from how BIDS do things for human neuroimaging. Here is an example from the BIDS main webpage (on the right handside): https://bids.neuroimaging.io/assets/img/dicom-reorganization-transparent-white_1000x477.png . But we obviously might deviate from this!
I would like to suggest the following hierarchy: Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived) within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.
Learning from the BIDS issues to not repeat them would be good as well, there is a project to have a json schema to describe BIDS that I think should be used as a model, to avoid the inconsitencies that we have in BIDS. This PR is pointing to the json schema code in BIDS
I would like to suggest the following hierarchy: Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived) within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.
thanks for this suggestion!
an interesting (and necessary) topic we'll have to discuss, will be to come to an agreement on what is a "project", what is a "dataset" etc. ; from all our discussions with the experimenters (mostly electrophysiologists working on non-human primate and rodents) within our institute, this was not as simple as one could expect ;)
I would like to suggest the following hierarchy: Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived) within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.
thanks for this suggestion!
an interesting (and necessary) topic we'll have to discuss, will be to come to an agreement on what is a "project", what is a "dataset" etc. ; from all our discussions with the experimenters (mostly electrophysiologists working on non-human primate and rodents) within our institute, this was not as simple as one could expect ;)
Yes, I can understand this, we've also had our discussions about this. So a bit of explanation. For us Project is like the introduction to an article, it covers the background or your research, affiliations, authors, hypotheses. and related work. Within a project you might run different experiments, using different methods, with different sets of animals. So here we think you should define different datasets. Datasets prescribe how data should be preprocessed, which is determined by which subjects were included, and what methods were used. This should be consistent within a dataset. Below this level could be more levels, but at the least I think there should be a Subject level and a Date_time_session level.
I've included a distinction between raw and derived, because when you share data you could decide only to share raw or only derived, and having this separation in the basis would make this easier. (I think our aim should be to make data more accessible and shareable.)
thanks for your explanations!
Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived)
We're actually very close to what you're suggesting in what we're locally trying to implement... Here is what we're suggesting for now: exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/rawdata exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/derivatives exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/metadata
Interestingly, we did not use the concepts of Project and Dataset, but we used the concept of Experiment, with one less level of hierarchy compared to what you describe... Our specs are here: https://int-nit.github.io/AnDOChecker/ . And we have the equivalent of the BIDS validator running for these specs: https://andocheck.int.univ-amu.fr/ . But we're still considering this is under development (we only have a few beta testers internally for now...) so we're totally up for merging / fusing / adapting / tweeking, which is what we hope can happen with this discussion!!!
Some details:
pinging @satra @yarikoptic @bendichter: you guys made the chose to go flatter (instead of using several levels of hierarchy as in BIDS and as the two aforementioned suggestions) with dandi, right? could you quickly summarize why?
pinging @tgbugs; could you also tell us what you came up with for sparc?
pinging @lepmik; also, how do you think such directory structure would interact with your exdir?
@SylvainTakerkart NWB files have internal structure that accommodates raw data, derived data, and metadata all in one file. The original DANDI layout deferred to NWB files for within-session data organization, and the file organization only handles the super-session info: dandiset and subject. That said, we have been talking on the DANDI team about separating raw from derived data as you have here, and we can discuss changing our format to match yours in that aspect.
@SylvainTakerkart - thanks for the ping. here are a few thoughts on the issue.
in bids i made a suggestion for consideration of individual level data vs data that is generated from aggregation of information across individuals (https://github.com/bids-standard/bids-2-devel/issues/43). it makes for a cleaner and objective separation of information that is contained in the folder. with the growing hardware ecosystem in neurophys for free behaving experiments (openephys + bonsai), we are also likely to see data that involves multiple interacting participants. we would still be able to organize information from the point of view of a participant, but that may include data about other participants.
the words rawdata, derivatives, and metadata are often linked quite directly to acquisition instruments, experimental techniques, pipelines, and implementation strategies. what is a derivative today could become raw tomorrow, as instruments integrate more processing into them. i would suggest avoiding that nomenclature if possible. this was also another reason for the participant level organization.
re: flattening in DANDI: the primary reason is the use of NWB, which itself is an organized store of information and metadata and contains most of the details within one construct. we did not want to replicate it outside and limit any information replication to the kinds of information that a neurophys researcher may typically want from perusing the filename.
another consideration in DANDI is to consider a world of objects where the data does not hit the filesystem ever or the filesystem is an object store. this is getting increasingly true for larger data with APIs providing access to information. In such a mode the organization in a typical folder level may be a short term consideration for people working on laptops/desktops with storage that is local and in more traditional HPC settings. given some of the data coming into DANDI, data transfer in general to do things would be too expensive (from a time and perhaps from a cost perspective). therefore accessing pieces of information as necessary for computation is where we are likely headed. this doesn't mean you cannot store information in a filesystem, but we are going to be moving a little closer even to our API for our data search clients and use datalad as the filesystem model. if you consider pybids, the first thing it does is generate a database index. but unlike bids where most datasets are a few GB at most, neurophys data is being generated at much larger sizes easily hitting TBs in many situations.
our current thinking in dandi is driven to support a changing landscape in neurophysiology over the next 5 - 10 years a bit more than what people have been doing traditionally. we have been asked multiple times to store hundreds of TBs of data for single datasets. that's a scale at which most people will not be looking at filesystem organization. we can always provide a view of the underlying data through a metadata remapping model. i personally consider bids to be a practical and efficient view of a more complex information model (one that we are capturing in a more structured manner in NIDM for example).
FWIW on the aspect of
sub-GUID/ses-YYYYMMDD_XXX_BBBB/rawdata
which exemplifies the suggestion to have filename itself without reflection of metadata encoded in the upper directories names.
I originally also was arguing for BIDS to not duplicate sub-
and ses-
in the filename since they are already present in the directory names. Main argumentation IIRC toward keeping them in the filename as well was: make file names unambiguous across subjects/sessions so they could be copied/shared (with collaborators) etc without causing confusion. Moreover, pragmatically, many tools display only the filename, not the full path to the file, in their listing of the files etc. So when I load multiple files in some viewer, I can tell from the filename alone to which subject they belong etc, instead of wondering from what that "rawdata" among 10 came from. In summary: I found it a bit more cumbersome but useful to have sub-, ses- in the filenames as well.
Some examples of the hierarchy we came up with are in this folder https://github.com/SciCrunch/sparc-curation/tree/master/resources/DatasetTemplate/primary.
At the top level we have 3 folders for different stages of data processing. source
primary
and derivative
.
Within primary the file folder level entities that we identified are pool, subject, sample, and performance.
There is a nasty issue with splitting subjects and samples, which is that they really occupy the same location in the data structure, and they are overly narrow in that they are not sufficient to capture higher levels of organization such as populations or subject groups. Therefore I suggest using specimen
as a top level entity which can capture populations, subjects, samples, etc. This leads to sparse tables, but it vastly simplifies the data model and the implementation of the validator code (the split samples/subjects is a pain to implement and maintain). I also suggest this because sample type is nearly always a required field, and thus can be used to distinguish between not only sample types, but also individual subjects, populations, etc. without further complicating the model. Happy to discuss this at length.
We don't have any requirement for the folder naming conventions aside from the fact that they should not have spaces in them. We are also suggesting not doing unfriendly things like giving a subject an identifier with a name that evokes a sample, e.g. heart-2
is a very bad identifier for a mouse. We map the type of entity the folder represents based on the metadata sheets, not by the prefix, with an exception for performance, where perf-X is a required prefix (pool-
might become required in the future, we don't have many examples of datasets that have used pools yet). We will likely continue to revisit these design decisions in the future. One note is that many groups have simply followed the convention of using sub-
and sam-
as prefixes. One other issue that we have encountered is that researchers often reuse sample ids across subjects (which is impossible, and and if it is pooling then pool-
should be used, not subject), so we have to construct a primary key from subject id + sample id, and sometimes investigators do that themselves, so you wind up with keys that look like sub-x_sub-x-sam-y
. There isn't an easy way around this unless you can get a validator deployed locally in labs prior to data submission.
With some colleagues, we are working on a folder template for projects inside the gin-tonic project. We are working on our first report. One thing interesting here, is that we got a similar feedback: one project is made of different experiments. We plan on creating a different data folder for each experiment/dataset. We did not go into the details of defining one experiment (yet) or the sub-structure of the data folder. But how it will be defined (example cases: same method but different animals, same animal used but different method, same method and same animal but re-tested at a different age, different animals and different methods but made for the same purpose,...)
I like the Source, Primary, Derivative distinction. I have heard that it is a problem with BIDS: researcher are supposed to save and archive the source data, but the BIDS formatted data is what go in the Primary folder here (and BIDS has a raw/derived data distinction..). Derivative would be everything that can be trashed, because one could reproduce it, right? But this would mean that this distinction would only make sense if it is higher order folders, and that this project would define the structure of the primary folder. This would also mean that one can use the source folder to feed data the way the researcher would see fit, and get a data curation (manual or automatic) process to change the source data into the form we want in the primary folder. In terms of data management, it would mean source is archived, Primary is the FAIR (open) version of the data, Derivative is trashed upon project completion.
(not that what I mean with derivative here would still be data files, like pivot table, summaries or similar files. Figures and analysis should not be saved with the data)
I still think that for animal study, getting the subject as high level folder make little sense. As noted above by @tgbugs, one file has often data from multiple subjects, animals are often tested in groups (especially in non-rodent research) or data from different subject are recorded in the same spreadsheet/video. I do not see any data management reason to split data by animal/subject. I would be curious to know why BIDS went for subject in the highest level, I would guess it is because people were used to emphasize the human subject in early studies ? (which would call against this structure for animal research).
Basic tips in file naming is to have unique file name, the place where the file is present should not be necessary to derive information about the file. Exception can be made for readme files.
IgfGUID is meant by this: https://de.wikipedia.org/wiki/Globally_Unique_Identifier, then it would make the name of the file too long. An internally used, shorter ID (RFID chip number, animal ID used in the animal facility,..) would make more sense. As long as there is metadata describing the ID used, we should be safe, while keeping some human readability.
The same survey about gin tonic made us think that researchers tend to prefer flat structures.
In another project, I also used a strategy where the metadata would indicate the file path, so that the structure is irrelevant for the data analysis (where the code read the metadata to access the data anyway). It is a pretty simple way to analyse data from different source without having to move or rename files manually...
I have been implementing validation of the SPARC Data Structure directory structure and have some further notes as a result of the process.
sub-
sam-
prefixes in identifiers then it is not possible to statically verify that the structure of the hierarchy is putatively correct without also having the specimen metadata files on hand with the types for those identifiers.@SylvainTakerkart - thanks for pushing this along. in addition to @tgbugs comments, we (in DANDI) have been working on the data/information model representing objects of interest (datasets, individual objects) with serialization to disk as a transform of that model. part of this consideration is driven by the sizes of individual datafiles (in the TBs range and growing) that we are expecting over the next year and our changing needs to support on the fly access to data or pieces of data over a network call. we are still tweaking the model, which would be serializable to disk, and will be releasing an updated API server by the end of the year together with datalad datasets. so any discussion of serialization is indeed a good thing to continue, but wanted to put the object + metadata access consideration in play as well.
Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?
Or imagine an already shared dataset: how would that look like if you were to reorganize it in the newly proposed structure? Would all data and metadata find a logical place in the new structure, is there (meta)data that does not fit, or is there (meta)data missing from the already shared dataset that you consider crucial?
Hi, a few thoughts on the discussion:
On a practical note maybe we should decouple discussion of directory organization from discussion about file naming: decide on directory organization first and then file naming next? There may be overlap but it may help simplify the decision-making process. Later we might try to enumerate the options under each to make any voting easier, e.g. subject vs session first etc.
Completely agree with the principle of separation into raw data and derived/processed data for reasons of curation as well as sharing.
The discussion, though informed by a great deal of collective experience, may be becoming too abstract at this stage. There may be risk that if we don't at the outset capture a wide enough range of most frequent neuroscience use-cases then converging of a common format for the range of different experimental designs and modalities may become overly time consuming with many iterations.
To make the discussion more concrete, following on from @robertoostenveld suggestion, we might collect together a range of existing datasets to represent the most common use-cases we expect the format should cope with to see how they might fit with any scheme we come up with. The Buzsaki lab, for example, has for many years shared their rodent hippocampal recording datasets many of which are available at crcns.org. In the process they have created a custom (session-based) data structure for their increasingly complicated combined ephys and behavioural experiments (see https://buzsakilab.com/wp/data-structure-and-format/). So perhaps we could consider using one or more of their datasets as a use-case for rodent in vivo behavioural recording datasets and add a few others from other domains such as VSD to encompass more of the general problem we attempting to solve?
Regarding 3 and 4: you may want to consider the https://en.wikipedia.org/wiki/Pareto_principle which states (or claims) that about 80% of the value comes from 20% of the cases, or 80% of the revenue comes from 20% of the customers, or 20% of goods in a shop make up 80% of the sales, etc.
So rather than making a very broad overview of all possible cases, which will be subsequently very difficult to work with, you could try to identiy those 20% of the cases that represent the most value. I.e., rather than investing in dealing with (relative) exceptional cases, first invest in the bulk.
When it comes to the number of use-cases I suggested one or more from rat ephys in vivo and a few from other domains. So, to keep things manageable, say a total of around five or six use-cases might be enough to start with and each could be selected as the most representative use-case by those with most knowledge of each domain.
Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?
@robertoostenveld - didn't see this earlier. you can look at the datasets on https://dandiarchive.org (there are about 38 dandisets covering 4 species, intra/extracellular and optical recordings)
Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?
Yes, this is an important question!!
This dataset (utah array ephys, several sessions, several animals, exhaustive metadata) could also be useful: https://doi.org/10.1038/sdata.2018.55 https://doi.gin.g-node.org/10.12751/g-node.f83565/
Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?
actually, I've opened a new thread to centralize a list of potentially useful datasets: https://github.com/INCF/neuroscience-data-structure/issues/7
This issue is where we'll discuss the directory structure that will contain the data and metadata, the organization of the directory and sub-directories, the number of levels in the hierarchy of the directories, the naming of the directories and sub-directories, the naming of the files...
The different elements that might be included in this hierarchy are: the experiment itself, the subject, the recording session etc.