bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Handle the comment specific type not recognized by pyarrow #83

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

This PR propose just to convert the specific Comment type of lxml that is not recognized by pyarrow (the underlying format used by datasets)

SaulLu commented 2 years ago

I'm merging it, but don't hesitate to leave a review afterwards :)