bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

add example that build a dataset #146

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

I thought it would be interesting to add an example to the repository to show how it is possible to test the rendering of the final dataset locally (i.e. using the latest version of the dataset) according to the chosen parameters.

Currently this script does not work because the metadata is not contained in the metadata column but in dedicated columns depending on their type. In other words, theses script can be a playground for the person that will take care of adapting the scripts to the new dataset format :slightly_smiling_face:

On my side, I used those scripts to check that we can load the current dataset.

A possible improvement to this script would be to add a parameter to save the dataset.