feat: add a feature to choose where to extract metadata - Githubissues

bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Apache License 2.0

30 stars 12 forks source link

feat: add a feature to choose where to extract metadata #116

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

Here is the beginning of the PR discussed in the meeting. This RP includes:

the possibility to choose in which column the metadata are extracted
create the column if it does not exist
a new UrlProcessor to extract the url in the same format as the other metadata
tests that extract in cascade several types of metadata (in the case 1) where they are extracted in different columns and 2) in the same column): can you all see if the extraction on this toy dataset suits you? :pray:

Ready for review:

Timestamp @cccntu
Website Desc @shanyas10
Entities @manandey (please note that I've addded some mocks to test your entities extraction in our github workflows)
Datasource & Generation Length @chkla (I've pinged you on some part of the code because I've included some changes I've suggested yesterday)

cc @timoschick (if you can have a more deep review for the HtmlProcessor and UrlProcessor part it would be of a great help!)