NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.45k stars 1.32k forks source link

Donut finetuning RVL-CDIP ipynb -- add class names to tokenizer as empty strings? #353

Open plamb-viso opened 9 months ago

plamb-viso commented 9 months ago

The ipynb states:

Prepare dataset
The first thing we'll do is add the class names as added tokens to the vocabulary of the decoder of Donut, and the corresponding tokenizer.

And then shows:

additional_tokens = ["", "", "", "", "", "", "",
  "", "", "", "", "", "",
  "", "", ""]

Why did this step add empty strings and not, for e.g. these class names:

id2label = {
  0: "letter",
  1: "form",
  2: "email",
  3: "handwritten",
  4: "advertisement",
  5: "scientific_report",
  6: "scientific_publication",
  7: "specification",
  8: "file_folder",
  9: "news_article",
  10: "budget",
  11: "invoice",
  12: "presentation",
  13: "questionnaire",
  14: "resume",
  15: "memo"
}
NielsRogge commented 9 months ago

It's because you're reading the notebook from Github, if you'll open the notebook in Colab you will see the classes.

:)