bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

feat: add a feature to choose where to extract metadata #116

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

Here is the beginning of the PR discussed in the meeting. This RP includes:

Ready for review:

cc @timoschick (if you can have a more deep review for the HtmlProcessor and UrlProcessor part it would be of a great help!)