bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Add filters to `HtmlProcessor` #127

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

This is long overdue! And here it is: this PR allows you to select the HTML tags that will actually be added to the examples. Several rules can be applied:

Needed for #85

SaulLu commented 2 years ago

I'm merging it because it's really needed and I think if we don't merge it we might forget it, I have no problem with making changes to it afterwards :smiley: