EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.32k stars 301 forks source link

Model failed with error `matmul is only supported for contiguous tensors lstride: [159744, 1] rstride: [1, 4096] mnk: (1, 32000, 4096)` #37

Closed ivanbaldo closed 7 months ago

ivanbaldo commented 7 months ago

Hi. Testing on AWS g5.2xlarge which has an Nvidia A10G fails with:

Serving on http://0.0.0.0:8080.
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/llama.rs:420:17:
Model failed with error `matmul is only supported for contiguous tensors lstride: [159744, 1] rstride: [1, 4096] mnk: (1, 32000, 4096)`. Please raise an issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError

Let me know if more info is needed, etc. Thanks!

ivanbaldo commented 7 months ago

Run with RUST_BACKTRACE=1 we get:

   0: candle_core::error::Error::bt
   1: candle_core::cuda_backend::<impl core::convert::From<candle_core::cuda_backend::CudaError> for candle_core::error::Error>::from
   2: <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::matmul
   3: candle_core::storage::Storage::matmul
   4: candle_core::tensor::Tensor::matmul
   5: <candle_nn::linear::Linear as candle_core::Module>::forward
   6: <candle_transformers::models::with_tracing::Linear as candle_core::Module>::forward
   7: mistralrs_core::models::llama::Llama::forward
   8: <mistralrs_core::pipeline::llama::LlamaPipeline as mistralrs_core::pipeline::Pipeline>::forward
   9: mistralrs_core::engine::Engine::run
  10: std::sys_common::backtrace::__rust_begin_short_backtrace
  11: core::ops::function::FnOnce::call_once{{vtable.shim}}
  12: std::sys::pal::unix::thread::Thread::new::thread_start
  13: <unknown>
  14: <unknown>
`. Please raise an issue.
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mistralrs_core::pipeline::llama::LlamaPipeline as mistralrs_core::pipeline::Pipeline>::forward
   3: mistralrs_core::engine::Engine::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'tokio-runtime-worker' panicked at mistralrs-server/src/main.rs:548:30:
called `Result::unwrap()` on an `Err` value: RecvError
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <F as axum::handler::Handler<(M,T1,T2),S>>::call::{{closure}}
   4: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   6: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
   7: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
   8: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
   9: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll
  10: <axum::serve::Serve<M,S> as core::future::into_future::IntoFuture>::into_future::{{closure}}::{{closure}}
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::raw::poll
  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  14: tokio::runtime::scheduler::multi_thread::worker::Context::run
  15: tokio::runtime::context::runtime::enter_runtime
  16: tokio::runtime::scheduler::multi_thread::worker::run
  17: tokio::runtime::task::core::Core<T,S>::poll
  18: tokio::runtime::task::harness::Harness<T,S>::poll
  19: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
ivanbaldo commented 7 months ago

Server launched with: mistralrs-server --port 8080 llama -m meta-llama/Llama-2-7b-chat-hf The test is with https://github.com/hamelsmu/llama-inference/blob/master/anyscale/client.py and OPENAI_API_BASE=http://localhost:8080/v1 OPENAI_API_KEY=none.

EricLBuehler commented 7 months ago

@ivanbaldo , this should be fixed in #38. Could you please try it on that branch?

ivanbaldo commented 6 months ago

Thanks! Now it failed because of not enough CUDA memory (with A10 which has 24G so it's strange, but will open another issue for that), but this specific issue seems solved now.