bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

feat: v2.yaml for the resampled training set #183

Closed tianjianjiang closed 1 year ago

tianjianjiang commented 1 year ago

Please kindly note that the random_sample_metadata_weights are TBD.

FYI: @jordiclive @norakassner @timoschick +CC @cccntu I will add some comments to explain the current situation for now.