edgenai / edgen

⚡ Edgen: Local, private GenAI server alternative to OpenAI. No GPU required. Run AI models locally: LLMs (Llama2, Mistral, Mixtral...), Speech-to-text (whisper) and many others.
https://docs.edgen.co/
Apache License 2.0
328 stars 15 forks source link

Tokio runtime panicking due to `llama_cpp::LlamaSession::context_size` using `block_on` #88

Closed denwong47 closed 6 months ago

denwong47 commented 6 months ago

Hi maintainers, I am not entirely sure if this is edgen's, llama_cpp-rs or my problem so I apologize in advance if I miss something obvious here.

I cloned from main (936a45afedbb208a177038c7379341a52b911786) to build from source, then did a release run to serve. Axum started listening correctly, and everything looks good:

RUST_BACKTRACE=full CUDA_ROOT=/usr/include/_remapped cargo run --features llama_cuda --release -- serve -g -b http://my_host:54321

However if I am to ping the v1/chat/completions endpoint as per the example, then the following panic occurs:

thread 'tokio-runtime-worker' panicked at /home/my_username/.cargo/git/checkouts/llama_cpp-rs-b9d51cabb4b43824/1141010/crates/llama_cpp/src/lib.rs:887:27:
Cannot block the current thread from within a runtime. This happens because a function attempted to block the current thread while the thread is being used to drive asynchronous tasks.

The relevant stack backtraces are:

  ...
  18:     0x5593475ee443 - core::option::expect_failed::h0d6627132effeebe
                               at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/option.rs:1985:5
  19:     0x559347ff727d - tokio::future::block_on::block_on::h57b11492fc0329f5
  20:     0x559347ff2d21 - llama_cpp::LlamaSession::context_size::h7a5ce26c608b13f9
  21:     0x559347a3d8b4 - llama_cpp::LlamaSession::start_completing_with::h9d25e453a5e19b0a
  22:     0x5593479b3576 - <core::pin::Pin<P> as core::future::future::Future>::poll::hf7d715e681ad5cb3
  23:     0x559347a0585f - <F as axum::handler::Handler<(M,T1),S>>::call::{{closure}}::h650dac4864c27520
  ...
  39:     0x559347a463af - tokio::runtime::task::harness::Harness<T,S>::poll::h3ae60ec675d41951
  40:     0x5593481ecd93 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h5f54c3255e052ad4
  41:     0x5593481e6c14 - tokio::runtime::context::scoped::Scoped<T>::set::h4261ff6b5d63c680
  42:     0x5593481af564 - tokio::runtime::context::runtime::enter_runtime::hb3189b7b6d405fc5
  43:     0x5593481ecaac - tokio::runtime::scheduler::multi_thread::worker::run::h11ac423a0e22051e
  ...

This occurs with or without --features llama_cuda.

This does not occur, however, if I checkout v0.1.2 instead, which produces a token-by-token output:

data: {"id":"856034ed-0e9e-4a56-94de-c9cb0be6cc90","choices":[{"delta":{"content":"Hello","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

data: {"id":"95a36b4b-4c7b-42fd-aa90-9fbfca770057","choices":[{"delta":{"content":"!","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

data: {"id":"24483190-b48d-48cc-ad0b-7459be014740","choices":[{"delta":{"content":" How","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

...

And on it goes that generates the sentence "Hello! How can I assist you today?" which I assume is the expected behaviour.

Looking at the exception it seems like it came from the fact that llama_cpp::LlamaSession::context_size internally started calling block_on last week, which the existing tokio::Runtime didn't appreciate.

Is this a bug, and if not, can anyone point me to the right direction here? Many thanks in advance.

System

Debian GNU/Linux 11 (bullseye) x86_64 rustc 1.75.0-beta.3 (b66b7951b 2023-11-20) as per rust-toolchain.toml (same happens to 1.76 stable anyway)

pedro-devv commented 6 months ago

It's actually the other way around, the problem is the lack of a block_on, main is using an old version of llama_cpp which is doing a blocking_read on an async environment. While I push changes to main, running cargo update should solve the issue.

denwong47 commented 6 months ago

Thank you @pedro-devv !