EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.4k stars 303 forks source link

Slow CUDA inference speed #763

Open ShelbyJenkins opened 1 month ago

ShelbyJenkins commented 1 month ago

This reports mistral.rs as being faster than llama.cpp: https://github.com/EricLBuehler/mistral.rs/discussions/612

But I'm seeing much slower speeds for the same prompt/settings.

Mistral.rs Usage { completion_tokens: 501, prompt_tokens: 28, total_tokens: 529, avg_tok_per_sec: 16.980707, avg_prompt_tok_per_sec: 76.08695, avg_compl_tok_per_sec: 16.27416, total_time_sec: 31.153, total_prompt_time_sec: 0.368, total_completion_time_sec: 30.785 }

llama.cpp timings: {\"predicted_ms\": 4007.64, \"prompt_per_token_ms\": 0.7041786, \"predicted_per_token_ms\": 8.01528, \"prompt_ms\": 19.717, \"prompt_per_second\": 1420.0944, \"predicted_n\": 500.0, \"prompt_n\": 28.0, \"predicted_per_second\": 124.7617},

The code I'm using to init mistral.rs: https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/llm_backends/mistral_rs/builder.rs#L110

I'm using the basic completion tests here: https://github.com/ShelbyJenkins/llm_client/blob/b1edca89bbdc34b884907fd39be6eedabf10d81b/src/basic_completion.rs#L158

Testing on ubuntu running an ubuntu docker container (FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04). I've tried loading the layers on to a single GPU using the device dummy map, and loading on both GPUs using the device mapper. These are 3090s and testing is done with Phi 3 mini.

ShelbyJenkins commented 1 month ago

I need to test out the version of cuda specified in the docker container and if that doesn't work I will test the benchmark following the instructions from the announcement linked above.

ShelbyJenkins commented 1 month ago

Updated to the same docker image but not the dockerfile. No changes.