bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Update tokenizer in evaluation #191

Closed manandey closed 1 year ago

manandey commented 1 year ago

-> Updated the tokenizer (hardcoded as of now, and would need to change the repo and subfolder if we change it while training again). -> Commented the code where we were setting the logits of entity related special tokens to -inf as of now until the issue of 'nan's' is fixed.