Closed ivanbaldo closed 7 months ago
Run with RUST_BACKTRACE=1
we get:
0: candle_core::error::Error::bt
1: candle_core::cuda_backend::<impl core::convert::From<candle_core::cuda_backend::CudaError> for candle_core::error::Error>::from
2: <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::matmul
3: candle_core::storage::Storage::matmul
4: candle_core::tensor::Tensor::matmul
5: <candle_nn::linear::Linear as candle_core::Module>::forward
6: <candle_transformers::models::with_tracing::Linear as candle_core::Module>::forward
7: mistralrs_core::models::llama::Llama::forward
8: <mistralrs_core::pipeline::llama::LlamaPipeline as mistralrs_core::pipeline::Pipeline>::forward
9: mistralrs_core::engine::Engine::run
10: std::sys_common::backtrace::__rust_begin_short_backtrace
11: core::ops::function::FnOnce::call_once{{vtable.shim}}
12: std::sys::pal::unix::thread::Thread::new::thread_start
13: <unknown>
14: <unknown>
`. Please raise an issue.
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::llama::LlamaPipeline as mistralrs_core::pipeline::Pipeline>::forward
3: mistralrs_core::engine::Engine::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: <F as axum::handler::Handler<(M,T1,T2),S>>::call::{{closure}}
4: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
6: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
7: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
8: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
9: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll
10: <axum::serve::Serve<M,S> as core::future::into_future::IntoFuture>::into_future::{{closure}}::{{closure}}
11: tokio::runtime::task::core::Core<T,S>::poll
12: tokio::runtime::task::raw::poll
13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
14: tokio::runtime::scheduler::multi_thread::worker::Context::run
15: tokio::runtime::context::runtime::enter_runtime
16: tokio::runtime::scheduler::multi_thread::worker::run
17: tokio::runtime::task::core::Core<T,S>::poll
18: tokio::runtime::task::harness::Harness<T,S>::poll
19: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Server launched with:
mistralrs-server --port 8080 llama -m meta-llama/Llama-2-7b-chat-hf
The test is with https://github.com/hamelsmu/llama-inference/blob/master/anyscale/client.py and OPENAI_API_BASE=http://localhost:8080/v1 OPENAI_API_KEY=none
.
@ivanbaldo , this should be fixed in #38. Could you please try it on that branch?
Thanks! Now it failed because of not enough CUDA memory (with A10 which has 24G so it's strange, but will open another issue for that), but this specific issue seems solved now.
Hi. Testing on AWS g5.2xlarge which has an Nvidia A10G fails with:
Let me know if more info is needed, etc. Thanks!