bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Add code to sampling multiple metadata #103

Closed cccntu closed 2 years ago

cccntu commented 2 years ago

According to discussion with @timoschick . Added a function to sample metadata as follows:

  1. uniformly decide how many types of metadata should be included.
  2. randomly choose metadata types to include.

WIP: need to add test and add code to actually call this function.

cccntu commented 2 years ago

Updates:

Note: by default data_config.metadata_config.random_sample_metadata is set to False We may want to set it to True in case we forget.