bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

change: retrieve the url from a column instead of from the metadata #104

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

As the dataset we are building has a column with the url of each document (cc @tianjianjiang ), I suggest that we look for the url in this column rather than in the metadata column (this will be faster).

@shanyas10 and @cccntu is that ok?

cc @timoschick for visibility