EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.59k stars 254 forks source link

WSL2 Docker error loading llama-3.1 gguf #679

Open underlines opened 1 month ago

underlines commented 1 month ago

Describe the bug

My environment

Windows 11 Pro, Docker Desktop, WSL2 Ubuntu Engine, latest nvidia driver

Cuda test

I made sure the Docker WSL2 Cuda implementation works correctly by executing: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark as stated in the documentation. So cuda works inside Docker with WSL2.

Model loading error

docker run --gpus all --rm -v C:\Users\xxx\.cache\lm-studio\models\duyntnet\Meta-Llama-3.1-8B-Instruct-imatrix-GGUF:/model -p 8080:8080 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf

leads to

...
2024-08-12T20:56:20.241100Z  INFO mistralrs_core::pipeline::paths: Loading `Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf` locally at `/model/Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf`
2024-08-12T20:56:20.244485Z  INFO mistralrs_core::pipeline::gguf: Loading model `/model` on cuda[0].
Error: path: "/model/Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf" unknown dtype for tensor 20

maybe iMatrix Quants are not supported?

Trying a normal gguf quant also doesn't seem to work:

docker run --gpus all --rm -v C:\Users\xxx\.cache\lm-studio\models\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF:/model -p 8080:8080 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf

leading to:

...
2024-08-12T20:55:28.177396Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-12T20:55:28.185104Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `...

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading dequantize_block_q8_0_f32

This is a newer quant after the rope freq issue was fixed in llama.cpp

Port argument not found

Also: I can use the docker argument -p 8080:1234 to map ports. The mistral.rs arguments for --serve-ip 0.0.0.0 works, the --port 1234 doesn't:

docker run --gpus all --rm -v C:\Users\Jan\.cache\lm-studio\models\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF:/model -p 8080:1234 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 --serve-ip 0.0.0.0 --port 1234 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf

leads to

error: the argument '--port <PORT>' cannot be used multiple times

Latest commit or version

Using Docker ericlbuehler/mistral.rs:cuda-90-sha-8a84d05

choronz commented 3 days ago

replicated the error error: the argument '--port <PORT>' cannot be used multiple times

when running

docker run --gpus all -it --rm --ipc=host -p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=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 -v f:/AIOps/Models/8b:/model ghcr.io/choronz/mistral.rs:cuda-89-latest --serve-ip 127.0.0.1 --port 8000 gguf -m /model -f Llama-3.1-Storm-8B.Q6_K.gguf

without --port 8000 the model loaded ran.

Seems that the port is still hardcoded in the Dockerfile.cuda-all file

PORT=8000 \ line 27 ENTRYPOINT ["mistralrs-server", "--port", "8000", "--token-source", "env:HUGGING_FACE_HUB_TOKEN"] line 54

EricLBuehler commented 3 days ago

@choronz please feel free to open a PR if you have a fix!