elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

Token Truncation differs from Transformers implementation #306

Closed thiagopromano closed 6 months ago

thiagopromano commented 6 months ago

While using the distilbert-base-uncased to tokenize a few sentences, I've noticed that the tokens, and, as a result, any classification done differ between the Bumblebee versus the Python Transformers library.

For example, on Elixir:

Mix.install([
  {:nx, "~> 0.5"},
  {:bumblebee, "~> 0.4.2"}
])

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "distilbert-base-uncased"})

text = "This is a very long text."
Bumblebee.apply_tokenizer(tokenizer, text,
            length: 10,
            return_token_type_ids: false
          )

%{
  "attention_mask" => #Nx.Tensor<
    u32[1][10]
    [
      [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
    ]
  >,
  "input_ids" => #Nx.Tensor<
    u32[1][10]
    [
      [101, 2023, 2003, 1037, 2200, 2146, 3793, 1012, 102, 0]
    ]
  >
}

Bumblebee.apply_tokenizer(tokenizer, text,
            length: 5,
            return_token_type_ids: false
          )

%{
  "attention_mask" => #Nx.Tensor<
    u32[1][5]
    [
      [1, 1, 1, 1, 1]
    ]
  >,
  "input_ids" => #Nx.Tensor<
    u32[1][5]
    [
      [101, 2023, 2003, 1037, 2200]
    ]
  >
}

We can see that when truncating, Bumblebee doesn't add the [SEP] (102) token at the end.

While on Transformers:

import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

device = torch.device('cpu')
text = 'This is a very long text.'

tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt", max_length=10).to(device)
{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2200, 2146, 3793, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt", max_length=5).to(device)
{'input_ids': tensor([[ 101, 2023, 2003, 1037,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

Transformers always has the token 102 at the end, even when truncating. I've noticed this behavior because the model trained using Transformers was not giving the same results when run by Bumblebee.Text.text_classification. After a bit of digging, I found out that it only happened when the token string was truncated.