Is your feature request related to a problem? Please describe.
Sometimes, access to parts of the metadata can be restricted to certain users. These limitations can be complex, as there can be more than two access tiers.
These access limitations are implemented by splitting the metadata across several files with different access rules.
However, this poses a number of problem:
Validation tests only run on recordings.csv and children.csv. Custom metadata can't be tested, because there is no way for the package to know where to find them
Users must modify/fine-tune their analyses to make them work with such datasets, according to the arbitrary choice of the owners.
Describe the solution you'd like
This needs more thoughts, but ideally the solution should allow all of this:
If some metadata fields are located in dataframes different than the defaults (i.e. recordings.csv and children.csv), then they should be indexed somewhere, so that we get a mapping from (table, column) -> filename, e.g.:
(children, languages) -> confidential/children.csv.
The python package should automatically gather these fields when loading a project (if they are available locally), and run the tests on the metadata once everything has been merged
We should allow some overall description of the content of private metadata at the highest level. This can help describe the content of the dataset, without revealing identifying information.
Zero metadata approach
We could merge all the files automatically, e.g. merge metadata/recordings.csv with everything inside of metadata/recordings/*
(same for children.csv). Which is my favorite approach I think, but then we need something to tell which file to prioritize in case of conflicting columns. We could do alphabetical order. I started implementing this
Flexible approach
Maybe some file metadata/description.csv with the following structure:
table
field
file
priority
description
values
children
languages
confidential/children.csv
0
children languages
english,french
The priority value can be set so that if several files may contain the same field, the one with the highest priority that is available is chosen. This is useful, for instance, when fake dates are provided to most users, but we still want to preserve the correct dates somewhere.
From this, it is easy to merge all available dataframes dynamically (which the package would do by itself, for those who are interested in using our python API)
Only non-standard fields should be documented in metadata/description.csv
Note that this can also solve the problem of non-standards fields in general, which require some documentation
Simpler approach
Maybe some file metadata/description.csv with the following structure:
Is your feature request related to a problem? Please describe.
Sometimes, access to parts of the metadata can be restricted to certain users. These limitations can be complex, as there can be more than two access tiers.
These access limitations are implemented by splitting the metadata across several files with different access rules. However, this poses a number of problem:
Describe the solution you'd like
This needs more thoughts, but ideally the solution should allow all of this:
(table, column) -> filename
, e.g.:(children, languages) -> confidential/children.csv
.Zero metadata approach
We could merge all the files automatically, e.g. merge metadata/recordings.csv with everything inside of metadata/recordings/* (same for children.csv). Which is my favorite approach I think, but then we need something to tell which file to prioritize in case of conflicting columns. We could do alphabetical order. I started implementing this
Flexible approach
Maybe some file
metadata/description.csv
with the following structure:The priority value can be set so that if several files may contain the same field, the one with the highest priority that is available is chosen. This is useful, for instance, when fake dates are provided to most users, but we still want to preserve the correct dates somewhere.
From this, it is easy to merge all available dataframes dynamically (which the package would do by itself, for those who are interested in using our python API)
Only non-standard fields should be documented in metadata/description.csv Note that this can also solve the problem of non-standards fields in general, which require some documentation
Simpler approach
Maybe some file
metadata/description.csv
with the following structure:What do you think ?