Cuda error when running gemma2

Describe the bug

When running this command RUST_BACKTRACE=full CUDA_LAUNCH_BLOCKING=1 target/release/mistralrs-server -i --isq Q4K -n "1:16;2:16;3:10" --no-paged-attn plain -m google/gemma-2-9b-it -a gemma2, I'm getting this error:

2024-08-27T05:53:27.880023Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true 2024-08-27T05:53:27.880068Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial 2024-08-27T05:53:27.880088Z INFO mistralrs_server: Model kind is: normal (no quant, no adapters) 2024-08-27T05:53:27.880235Z INFO mistralrs_core::pipeline::normal: Loading tokenizer.json at google/gemma-2-9b-it 2024-08-27T05:53:27.880270Z INFO mistralrs_core::pipeline::normal: Loading tokenizer.json locally at google/gemma-2-9b-it/tokenizer.json 2024-08-27T05:53:27.880279Z INFO mistralrs_core::pipeline::normal: Loading config.json at google/gemma-2-9b-it 2024-08-27T05:53:27.880298Z INFO mistralrs_core::pipeline::normal: Loading config.json locally at google/gemma-2-9b-it/config.json 2024-08-27T05:53:27.882701Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00002-of-00004.safetensors", "model-00004-of-00004.safetensors", "model-00001-of-00004.safetensors", "model-00003-of-00004.safetensors"] 2024-08-27T05:53:27.882733Z INFO mistralrs_core::pipeline::paths: Loading model-00002-of-00004.safetensors locally at google/gemma-2-9b-it/model-00002-of-00004.safetensors 2024-08-27T05:53:27.882753Z INFO mistralrs_core::pipeline::paths: Loading model-00004-of-00004.safetensors locally at google/gemma-2-9b-it/model-00004-of-00004.safetensors 2024-08-27T05:53:27.882771Z INFO mistralrs_core::pipeline::paths: Loading model-00001-of-00004.safetensors locally at google/gemma-2-9b-it/model-00001-of-00004.safetensors 2024-08-27T05:53:27.882792Z INFO mistralrs_core::pipeline::paths: Loading model-00003-of-00004.safetensors locally at google/gemma-2-9b-it/model-00003-of-00004.safetensors 2024-08-27T05:53:27.882890Z INFO mistralrs_core::pipeline::normal: Loading generation_config.json at google/gemma-2-9b-it 2024-08-27T05:53:27.882910Z INFO mistralrs_core::pipeline::normal: Loading generation_config.json locally at google/gemma-2-9b-it/generation_config.json 2024-08-27T05:53:27.882998Z INFO mistralrs_core::pipeline::normal: Loading tokenizer_config.json at google/gemma-2-9b-it 2024-08-27T05:53:27.883017Z INFO mistralrs_core::pipeline::normal: Loading tokenizer_config.json locally at google/gemma-2-9b-it/tokenizer_config.json 2024-08-27T05:53:27.945489Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 6.1 2024-08-27T05:53:27.945511Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0 2024-08-27T05:53:27.967209Z INFO mistralrs_core::utils::normal: DType selected is F16. 2024-08-27T05:53:27.967274Z INFO mistralrs_core::pipeline::normal: Model config: Config { attention_bias: false, head_dim: 256, hidden_act: Some(GeluPytorchTanh), hidden_activation: Some(GeluPytorchTanh), hidden_size: 3584, intermediate_size: 14336, num_attention_heads: 16, num_hidden_layers: 42, num_key_value_heads: 8, rms_norm_eps: 1e-6, rope_theta: 10000.0, vocab_size: 256000, sliding_window: 4096, attn_logit_softcapping: Some(50.0), final_logit_softcapping: Some(30.0), query_pre_attn_scalar: 256, max_position_embeddings: 8192 } 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 105/105 [00:09<00:00, 10.73it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [00:11<00:00, 13.74it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:11<00:00, 2568.93it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 134/134 [00:11<00:00, 11.24it/s] 2024-08-27T05:53:39.957659Z INFO mistralrs_core::device_map: Model has 42 repeating layers. 2024-08-27T05:53:40.313270Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings: 2024-08-27T05:53:40.313292Z INFO mistralrs_core::device_map: Layer 0: cuda[1] 2024-08-27T05:53:40.313298Z INFO mistralrs_core::device_map: Layer 1: cuda[1] 2024-08-27T05:53:40.313304Z INFO mistralrs_core::device_map: Layer 2: cuda[1] 2024-08-27T05:53:40.313310Z INFO mistralrs_core::device_map: Layer 3: cuda[1] 2024-08-27T05:53:40.313317Z INFO mistralrs_core::device_map: Layer 4: cuda[1] 2024-08-27T05:53:40.313323Z INFO mistralrs_core::device_map: Layer 5: cuda[1] 2024-08-27T05:53:40.313329Z INFO mistralrs_core::device_map: Layer 6: cuda[1] 2024-08-27T05:53:40.313337Z INFO mistralrs_core::device_map: Layer 7: cuda[1] 2024-08-27T05:53:40.313343Z INFO mistralrs_core::device_map: Layer 8: cuda[1] 2024-08-27T05:53:40.313349Z INFO mistralrs_core::device_map: Layer 9: cuda[1] 2024-08-27T05:53:40.313355Z INFO mistralrs_core::device_map: Layer 10: cuda[1] 2024-08-27T05:53:40.313361Z INFO mistralrs_core::device_map: Layer 11: cuda[1] 2024-08-27T05:53:40.313367Z INFO mistralrs_core::device_map: Layer 12: cuda[1] 2024-08-27T05:53:40.313373Z INFO mistralrs_core::device_map: Layer 13: cuda[1] 2024-08-27T05:53:40.313379Z INFO mistralrs_core::device_map: Layer 14: cuda[1] 2024-08-27T05:53:40.313385Z INFO mistralrs_core::device_map: Layer 15: cuda[1] 2024-08-27T05:53:40.313391Z INFO mistralrs_core::device_map: Layer 16: cuda[2] 2024-08-27T05:53:40.313396Z INFO mistralrs_core::device_map: Layer 17: cuda[2] 2024-08-27T05:53:40.313402Z INFO mistralrs_core::device_map: Layer 18: cuda[2] 2024-08-27T05:53:40.313408Z INFO mistralrs_core::device_map: Layer 19: cuda[2] 2024-08-27T05:53:40.313414Z INFO mistralrs_core::device_map: Layer 20: cuda[2] 2024-08-27T05:53:40.313420Z INFO mistralrs_core::device_map: Layer 21: cuda[2] 2024-08-27T05:53:40.313426Z INFO mistralrs_core::device_map: Layer 22: cuda[2] 2024-08-27T05:53:40.313432Z INFO mistralrs_core::device_map: Layer 23: cuda[2] 2024-08-27T05:53:40.313438Z INFO mistralrs_core::device_map: Layer 24: cuda[2] 2024-08-27T05:53:40.313443Z INFO mistralrs_core::device_map: Layer 25: cuda[2] 2024-08-27T05:53:40.313449Z INFO mistralrs_core::device_map: Layer 26: cuda[2] 2024-08-27T05:53:40.313455Z INFO mistralrs_core::device_map: Layer 27: cuda[2] 2024-08-27T05:53:40.313461Z INFO mistralrs_core::device_map: Layer 28: cuda[2] 2024-08-27T05:53:40.313467Z INFO mistralrs_core::device_map: Layer 29: cuda[2] 2024-08-27T05:53:40.313474Z INFO mistralrs_core::device_map: Layer 30: cuda[2] 2024-08-27T05:53:40.313480Z INFO mistralrs_core::device_map: Layer 31: cuda[2] 2024-08-27T05:53:40.313486Z INFO mistralrs_core::device_map: Layer 32: cuda[3] 2024-08-27T05:53:40.313491Z INFO mistralrs_core::device_map: Layer 33: cuda[3] 2024-08-27T05:53:40.313497Z INFO mistralrs_core::device_map: Layer 34: cuda[3] 2024-08-27T05:53:40.313503Z INFO mistralrs_core::device_map: Layer 35: cuda[3] 2024-08-27T05:53:40.313509Z INFO mistralrs_core::device_map: Layer 36: cuda[3] 2024-08-27T05:53:40.313516Z INFO mistralrs_core::device_map: Layer 37: cuda[3] 2024-08-27T05:53:40.313522Z INFO mistralrs_core::device_map: Layer 38: cuda[3] 2024-08-27T05:53:40.313528Z INFO mistralrs_core::device_map: Layer 39: cuda[3] 2024-08-27T05:53:40.313533Z INFO mistralrs_core::device_map: Layer 40: cuda[3] 2024-08-27T05:53:40.313539Z INFO mistralrs_core::device_map: Layer 41: cuda[3] 2024-08-27T05:53:40.372062Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 6.1 2024-08-27T05:53:40.372073Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0 2024-08-27T05:53:40.377781Z INFO mistralrs_core::utils::normal: DType selected is F16. 2024-08-27T05:53:45.119030Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Q4K to 295 tensors. 2024-08-27T05:53:45.124278Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 80 threads. 2024-08-27T05:54:33.977956Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Q4K to 295 tensors out of 295 total tensors. Took 48.86s 2024-08-27T05:54:33.978167Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization bias device mapping to 294 biases. 2024-08-27T05:54:33.978461Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 80 threads. 2024-08-27T05:54:33.980828Z INFO mistralrs_core::pipeline::isq: Applied in-situ quantization device mapping. Took 0.00s 2024-08-27T05:54:34.841869Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "", eos_toks = "", "", unk_tok = 2024-08-27T05:54:34.912589Z INFO mistralrs_server: Model loaded. 2024-08-27T05:54:34.918294Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16. 2024-08-27T05:54:34.927413Z INFO mistralrs_core: Enabling GEMM reduced precision in F16. 2024-08-27T05:54:34.928631Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle 2024-08-27T05:54:34.928750Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }

hello thread '' panicked at /home/wosai/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/driver/safe/core.rs:252:76: called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered") stack backtrace: 0: 0x561eb8a72738 - ::fmt::h72ae2693fe3679c0 1: 0x561eb7f3a96b - core::fmt::write::heb2112eefb3480d2 2: 0x561eb8a3512e - std::io::Write::write_fmt::hb21f507e35bcfafb 3: 0x561eb8a74629 - std::sys::backtrace::print::h8c7c13a068389915 4: 0x561eb8a73959 - std::panicking::default_hook::{{closure}}::h165d0776fc5b2025 5: 0x561eb8a750e5 - std::panicking::rust_panic_with_hook::h56be292a19683b4c 6: 0x561eb8a74a15 - std::panicking::begin_panic_handler::{{closure}}::h3deeb56cba176ab2 7: 0x561eb8a74979 - std::sys::backtrace::rust_end_short_backtrace::hc30be025c447bdc1 8: 0x561eb8a74963 - rust_begin_unwind 9: 0x561eb7f38b51 - core::panicking::panic_fmt::h419642f979996b15 10: 0x561eb7f41185 - core::result::unwrap_failed::h0529da0e6cd0d54f 11: 0x561eb825f509 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h079552940b2be28e 12: 0x561eb825ec41 - alloc::sync::Arc<T,A>::drop_slow::hc72120ce55330bbd 13: 0x561eb825eb70 - core::ptr::drop_in_place::h219980205baf0911 14: 0x561eb825c8ad - alloc::sync::Arc<T,A>::drop_slow::h85386a33c23e275d 15: 0x561eb8693334 - mistralrs_core::models::gemma2::Model::forward::h589ae9cc57b30a4d 16: 0x561eb869198d - ::forward::hf8049a3b5a676b8c 17: 0x561eb82c42d5 - ::forward_inputs::h9c3c7b255d6700fb 18: 0x561eb82f5494 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::hba58536883e3c927 19: 0x561eb8788fca - mistralrs_core::engine::Engine::run::{{closure}}::h2fc8bac59768ce2d 20: 0x561eb87840ea - std::sys::backtrace::rust_begin_short_backtrace::h86ca94e37e66cb93 21: 0x561eb878338a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hdc48886c0e3332a8 22: 0x561eb8a772fb - std::sys::pal::unix::thread::Thread::new::thread_start::h4aa16783dfb29b35 23: 0x7f75b88f4ac3 - 24: 0x7f75b8986850 - 25: 0x0 - thread '' panicked at /home/wosai/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.6/src/driver/safe/core.rs:252:76: called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered") stack backtrace: 0: 0x561eb8a72738 - ::fmt::h72ae2693fe3679c0 1: 0x561eb7f3a96b - core::fmt::write::heb2112eefb3480d2 2: 0x561eb8a3512e - std::io::Write::write_fmt::hb21f507e35bcfafb 3: 0x561eb8a74629 - std::sys::backtrace::print::h8c7c13a068389915 4: 0x561eb8a73959 - std::panicking::default_hook::{{closure}}::h165d0776fc5b2025 5: 0x561eb8a750e5 - std::panicking::rust_panic_with_hook::h56be292a19683b4c 6: 0x561eb8a74a15 - std::panicking::begin_panic_handler::{{closure}}::h3deeb56cba176ab2 7: 0x561eb8a74979 - std::sys::backtrace::rust_end_short_backtrace::hc30be025c447bdc1 8: 0x561eb8a74963 - rust_begin_unwind 9: 0x561eb7f38b51 - core::panicking::panic_fmt::h419642f979996b15 10: 0x561eb7f41185 - core::result::unwrap_failed::h0529da0e6cd0d54f 11: 0x561eb825f509 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h079552940b2be28e 12: 0x561eb825ec41 - alloc::sync::Arc<T,A>::drop_slow::hc72120ce55330bbd 13: 0x561eb825eb70 - core::ptr::drop_in_place::h219980205baf0911 14: 0x561eb825c8ad - alloc::sync::Arc<T,A>::drop_slow::h85386a33c23e275d 15: 0x561eb869547c - mistralrs_core::models::gemma2::Model::forward::h589ae9cc57b30a4d 16: 0x561eb869198d - ::forward::hf8049a3b5a676b8c 17: 0x561eb82c42d5 - ::forward_inputs::h9c3c7b255d6700fb 18: 0x561eb82f5494 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::hba58536883e3c927 19: 0x561eb8788fca - mistralrs_core::engine::Engine::run::{{closure}}::h2fc8bac59768ce2d 20: 0x561eb87840ea - std::sys::backtrace::rust_begin_short_backtrace::h86ca94e37e66cb93 21: 0x561eb878338a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hdc48886c0e3332a8 22: 0x561eb8a772fb - std::sys::pal::unix::thread::Thread::new::thread_start::h4aa16783dfb29b35 23: 0x7f75b88f4ac3 - 24: 0x7f75b8986850 - 25: 0x0 - thread '' panicked at library/core/src/panicking.rs:229:5: panic in a destructor during cleanup thread caused non-unwinding panic. aborting. Aborted (core dumped)

EricLBuehler / mistral.rs

Cuda error when running gemma2 #715

Describe the bug