Documenting datasets - Githubissues

lucasgautheron commented 3 years ago

Is your feature request related to a problem? Please describe.

Datasets always need to be documented. Documentation may include information about:

The authors
The data collection process (population, method, etc.)
The variables included in the data (table, name, description, values)
Any errors known by the authors

Ideally, some of the documentation should be machine-readable in order to improve discoverability. Machine-readability may also be exploited by DataLad's metadata extractors.

For instance, GIN uses the datacite scheme (using YAML), which is used to generate the DOI and the metadata associated to it: https://gin.g-node.org/G-Node/Info/wiki/DOIfile#creating-a-datacite-metadata-file.

The variables can also be documented using machine-readable formats. The most obvious candidates are CSV, YAML, or XML. However, it is likely that some of these information won't fit in rigid structures. We should encourage people to use formats such as Markdown rather than docx maybe for such information...

Describe the solution you'd like

README.md recommended
documentation folder for additional documentation
documentation/children.csv and documentation/recordings.csv for standardized documentation, with 3 columns:
- variable name
- variable description
- variable values
- variable scope (who has access to it)

alecristia commented 3 years ago

this sounds great, and I second the solution.

I also wonder whether we want to add something nobody includes but will be increasingly necessary, I think: the proof of ethical permission for the data collection & sharing, and a sample consent form. That will definitely not be machine-readable for now.

How about contact information for the authors? For EL1000, see table under this header. Author contact info should stay with the data, and can change too (eg if someone retires)

lucasgautheron commented 3 years ago

Regarding authorship, can we use this format ? https://gin.g-node.org/G-Node/Info/wiki/DOIfile#creating-a-datacite-metadata-file

On GIN, once this file has been created, informations will show at the bottom of the repository main page, see here for instance: https://gin.g-node.org/LAAC-LSCP/managing-storing-sharing-paper

LAAC-LSCP / ChildProject

Documenting datasets #207