bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

feat: change how the entity extraction process use ids #115

Closed SaulLu closed 2 years ago

SaulLu commented 2 years ago

Before this PR, the extraction of named entities required an "id" column, I propose with this PR to generate on the fly the ids needed for batching in the REL library. This feature allows to avoid having to design an ids column upstream.

cc @timoschick , @norakassner

manandey commented 2 years ago

LGTM! Thanks! :)