Better handling of multi-tier metadata

Is your feature request related to a problem? Please describe.

Sometimes, access to parts of the metadata can be restricted to certain users. These limitations can be complex, as there can be more than two access tiers.

These access limitations are implemented by splitting the metadata across several files with different access rules. However, this poses a number of problem:

Validation tests only run on recordings.csv and children.csv. Custom metadata can't be tested, because there is no way for the package to know where to find them
Users must modify/fine-tune their analyses to make them work with such datasets, according to the arbitrary choice of the owners.

Describe the solution you'd like

This needs more thoughts, but ideally the solution should allow all of this:

If some metadata fields are located in dataframes different than the defaults (i.e. recordings.csv and children.csv), then they should be indexed somewhere, so that we get a mapping from (table, column) -> filename, e.g.: (children, languages) -> confidential/children.csv.
The python package should automatically gather these fields when loading a project (if they are available locally), and run the tests on the metadata once everything has been merged
We should allow some overall description of the content of private metadata at the highest level. This can help describe the content of the dataset, without revealing identifying information.

Zero metadata approach

We could merge all the files automatically, e.g. merge metadata/recordings.csv with everything inside of metadata/recordings/* (same for children.csv). Which is my favorite approach I think, but then we need something to tell which file to prioritize in case of conflicting columns. We could do alphabetical order. I started implementing this

Flexible approach

Maybe some file metadata/description.csv with the following structure:

table	field	file	priority	description	values
children	languages	confidential/children.csv	0	children languages	english,french

The priority value can be set so that if several files may contain the same field, the one with the highest priority that is available is chosen. This is useful, for instance, when fake dates are provided to most users, but we still want to preserve the correct dates somewhere.

From this, it is easy to merge all available dataframes dynamically (which the package would do by itself, for those who are interested in using our python API)

Only non-standard fields should be documented in metadata/description.csv Note that this can also solve the problem of non-standards fields in general, which require some documentation

Simpler approach

Maybe some file metadata/description.csv with the following structure:

table	file	priority	description
children	confidential/children.csv	0	adds children languages

What do you think ?

LAAC-LSCP / ChildProject