Bumblebee.apply_tokenizer fails for empty text

elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)

Apache License 2.0

1.26k stars 90 forks source link

Bumblebee.apply_tokenizer fails for empty text #373

Closed yujonglee closed 1 month ago

yujonglee commented 1 month ago

** (ArgumentError) cannot build an empty tensor
    (nx 0.7.2) lib/nx.ex:1966: Nx.from_binary/3
    (bumblebee 0.5.3) lib/bumblebee/text/pre_trained_tokenizer.ex:397: Bumblebee.Text.PreTrainedTokenizer.u32_binaries_to_tensor/1
    (bumblebee 0.5.3) lib/bumblebee/text/pre_trained_tokenizer.ex:304: Bumblebee.Text.PreTrainedTokenizer.apply/2
    iex:3: (file)

To reproduce:

{:ok, tok} = Bumblebee.load_tokenizer({:hf, "Xenova/gpt-4"}, type: :gpt2)
Bumblebee.apply_tokenizer(tok, "")

jonatanklosko commented 1 month ago

It is expected to fail, but we can improve the error message, yeah :) In your app you should check if the text is blank and not try to tokenize in such case.

jonatanklosko commented 1 month ago

Actually, for some tokenizers an empty string works, it depends if they add special tokens or not (in Stable Diffusion we actually do tokenize an empty string). Either way, I added a more specific error when it returns zero tokens (3a597615344def7dd2e8f8df2dfe21ad758e094a).