Token Truncation differs from Transformers implementation

While using the distilbert-base-uncased to tokenize a few sentences, I've noticed that the tokens, and, as a result, any classification done differ between the Bumblebee versus the Python Transformers library.

For example, on Elixir:

Mix.install([
  {:nx, "~> 0.5"},
  {:bumblebee, "~> 0.4.2"}
])

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "distilbert-base-uncased"})

text = "This is a very long text."
Bumblebee.apply_tokenizer(tokenizer, text,
            length: 10,
            return_token_type_ids: false
          )

%{
  "attention_mask" => #Nx.Tensor<
    u32[1][10]
    [
      [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
    ]
  >,
  "input_ids" => #Nx.Tensor<
    u32[1][10]
    [
      [101, 2023, 2003, 1037, 2200, 2146, 3793, 1012, 102, 0]
    ]
  >
}

Bumblebee.apply_tokenizer(tokenizer, text,
            length: 5,
            return_token_type_ids: false
          )

%{
  "attention_mask" => #Nx.Tensor<
    u32[1][5]
    [
      [1, 1, 1, 1, 1]
    ]
  >,
  "input_ids" => #Nx.Tensor<
    u32[1][5]
    [
      [101, 2023, 2003, 1037, 2200]
    ]
  >
}

We can see that when truncating, Bumblebee doesn't add the [SEP] (102) token at the end.

While on Transformers:

import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

device = torch.device('cpu')
text = 'This is a very long text.'

tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt", max_length=10).to(device)
{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2200, 2146, 3793, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt", max_length=5).to(device)
{'input_ids': tensor([[ 101, 2023, 2003, 1037,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

Transformers always has the token 102 at the end, even when truncating. I've noticed this behavior because the model trained using Transformers was not giving the same results when run by Bumblebee.Text.text_classification. After a bit of digging, I found out that it only happened when the token string was truncated.

elixir-nx / bumblebee

Token Truncation differs from Transformers implementation #306