centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Add datasheets #265

Open KennethEnevoldsen opened 5 months ago

KennethEnevoldsen commented 5 months ago

Agreed with @peterbjorgensen to add datasheets to our datasets.

@jankounchained will you add one for NCC @TTTTao725 will you add one for the datasets you created - we can discuss it next time you are in @peterbjorgensen

For now feel free to keep them minimal (then we can always expand on it). Is there anything we feel like the datasheet should at least contain?

It might be useful to look at:

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.

previous datasheets: https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/docs/datasheets

peterbjorgensen commented 5 months ago

As I wrote in another issue:

For each dataset we should have a dataset card or datasheet in the same style has HuggingFace data cards https://huggingface.co/docs/hub/datasets-cards I prefer to have the dataset cards on github to be able to track changes. The filename of the datasheet should be the same name as the "source" identifier, i.e. {source}.md. The data card contains a header in yaml to make it machine readable, which is then followed by descriptions in markdown. I see that it can be a problem if the datasets contain sub-sources with different licenses for example. In that case the license field in the yaml should be a dictionary that maps from sub-sources to a specific license.

Alternatively the license field could be a keyword, e.g. multiple and then we add a "license" field in the "metadata" dictionary of each document. I think I will prefer the yaml dictionary approach, because the idea is that the datasheets makes it possible to select datasets based on the metadata without reading through the actual data first.

License should also be a required field in my opinion.