Error when using TinyLlama

trickster commented 5 months ago

TinyLlama uses the same architecture and tokenizer as Llama 2.

When I am trying to create serving, I get the following error

Here is the full output of a livebook

System.put_env("EXLA_TARGET", "cuda120")

Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino, "~> 0.11.0"}
])

Application.put_env(:exla, :clients,
  cuda: [platform: :cuda, preallocate: false],
  rocm: [platform: :rocm, preallocate: false],
  tpu: [platform: :tpu, preallocate: false],
  host: [platform: :host, preallocate: false]
)

Nx.global_default_backend(EXLA.Backend)

Section

llama = "TinyLlama/TinyLlama-1.1B-Chat-v0.4"
# llama = "cognitivecomputations/dolphin-llama2-7b"

"TinyLlama/TinyLlama-1.1B-Chat-v0.4"

{:ok, model_info} = Bumblebee.load_model({:hf, llama})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, llama})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, llama})
generation_config = Bumblebee.configure(generation_config, max_new_tokens: 500)


09:57:51.608 [info] XLA service 0x7f8280020c90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

09:57:51.608 [info]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9

09:57:51.609 [info] Using BFC allocator.

09:57:51.609 [info] XLA backend will use up to 21225406464 bytes on device 0 for BFCAllocator.

09:57:51.844 [info] Loaded cuDNN version 8904

09:57:51.858 [info] Using nvlink for parallel linking

%Bumblebee.Text.GenerationConfig{
  max_new_tokens: 500,
  min_new_tokens: nil,
  max_length: nil,
  min_length: nil,
  strategy: %{type: :greedy_search},
  decoder_start_token_id: nil,
  forced_bos_token_id: nil,
  forced_eos_token_id: nil,
  forced_token_ids: [],
  suppressed_token_ids: [],
  no_repeat_ngram_length: nil,
  temperature: nil,
  bos_token_id: nil,
  eos_token_id: nil,
  pad_token_id: nil,
  extra_config: nil
}

tiny_llama_serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    preallocate_params: true,
    stream: true,
    defn_options: [debug: true, client: :cuda, compiler: EXLA]
  )

%Nx.Serving{
  module: Nx.Serving.Default,
  arg: #Function<0.20657473/2 in Bumblebee.Text.TextGeneration.generation/4>,
  client_preprocessing: #Function<1.20657473/1 in Bumblebee.Text.TextGeneration.generation/4>,
  client_postprocessing: #Function<2.20657473/2 in Bumblebee.Text.TextGeneration.maybe_stream/3>,
  streaming: %{hooks: [:token]},
  batch_size: 1,
  distributed_postprocessing: &Function.identity/1,
  process_options: [batch_keys: [sequence_length: 1028]],
  defn_options: [debug: true, client: :cuda, compiler: EXLA]
}

Kino.start_child({Nx.Serving, name: TinyLlamaServing, serving: tiny_llama_serving})

{:error,
 {:shutdown,
  {:failed_to_start_child, Nx.Serving,
   {%Protocol.UndefinedError{protocol: Nx.LazyContainer, value: nil, description: ""},
    [
      {Nx.LazyContainer.Atom, :traverse, 3, [file: ~c"lib/nx/lazy_container.ex", line: 91]},
      {Nx, :to_tensor, 1, [file: ~c"lib/nx.ex", line: 2067]},
      {Nx, :broadcast, 3, [file: ~c"lib/nx.ex", line: 3702]},
      {Bumblebee.Text.Generation, :"__defn:init_sequences__", 3,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 469]},
      {Bumblebee.Text.Generation, :"__defn:greedy__", 7,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 419]},
      {Bumblebee.Text.Generation, :"__defn:generate_impl__", 8,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 357]},
      {Nx.Defn.Compiler, :runtime_fun, 3, [file: ~c"lib/nx/defn/compiler.ex", line: 173]},
      {EXLA.Defn, :"-compile/8-fun-3-", 4, [file: ~c"lib/exla/defn.ex", line: 411]}
    ]}}}}

user_input = Kino.Input.textarea("User prompt", default: "What is love?")

user = Kino.Input.read(user_input)

prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)

I get :error, {:shutdown, {:failed_to_start_child, Nx.Serving, error

jonatanklosko commented 5 months ago

The generation_config.json doesn't have pad_token_id nor eos_token_id, which should generally be set. The model card says it's a fine tuned version of TinyLlama/TinyLlama-1.1B-intermediate-step-715k-1.5T, which does have these in the config. You can set these manually:

generation_config = Bumblebee.configure(generation_config, pad_token_id: 0, eos_token_id: 1, bos_token_id: 2)

We should have a better error message, so let's keep this open.

jonatanklosko commented 4 months ago

Closed in e59bb28.

elixir-nx / bumblebee

Error when using TinyLlama #325

Section