While using the distilbert-base-uncased to tokenize a few sentences, I've noticed that the tokens, and, as a result, any classification done differ between the Bumblebee versus the Python Transformers library.
Transformers always has the token 102 at the end, even when truncating. I've noticed this behavior because the model trained using Transformers was not giving the same results when run by Bumblebee.Text.text_classification. After a bit of digging, I found out that it only happened when the token string was truncated.
While using the
distilbert-base-uncased
to tokenize a few sentences, I've noticed that the tokens, and, as a result, any classification done differ between the Bumblebee versus the Python Transformers library.For example, on Elixir:
We can see that when truncating, Bumblebee doesn't add the
[SEP]
(102) token at the end.While on Transformers:
Transformers always has the token 102 at the end, even when truncating. I've noticed this behavior because the model trained using Transformers was not giving the same results when run by
Bumblebee.Text.text_classification
. After a bit of digging, I found out that it only happened when the token string was truncated.