ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.56k stars 9.85k forks source link

[soft max] capping the num tasks to 4 is limiting the prompt eval perf #5103

Closed snadampal closed 10 months ago

snadampal commented 10 months ago

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

System: AWS Graviton3, c7g.16xl instance with Ubuntu 22.04 llama.cpp version: latest, commit: 6f9939d119b2d004c264952eb510bd106455531e

the following commit is capping the num of tasks to 4. I would like to understand why 4?

commit adf3de4f69ff7e44131222f05f9c7447ac0be3cb (HEAD, tag: b1605)
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sun Dec 3 15:56:22 2023 +0200

    ggml : fix soft max out-of-bounds access (#4307)

    ggml-ci

Without 4 and just using the src num rows or n_threads for n_tasks, the prompt eval performance is improved by 4% for DOT kernels and 9% for MMLA kernels (PR) n_tasks = MIN(n_threads, ggml_nrows(node->src[0]));

Reproducer: ./main -m /llama.cpp/models/open_llama_13b/ggml-model-q8_0.gguf -c 1015 -n 256 -t 64 --file <input_file.txt>

ggerganov commented 10 months ago

Not really sure where the limit of 4 came from - probably I did some measurements and decided that there is no reason to use larger number of tasks. But it's likely something to be revisited and updated

snadampal commented 10 months ago

thanks. since it was done to take care of out of bounds issue, could you please share some details about the original issue? so that we can try to arrive at some formula to take care of systems with different number of threads.

ggerganov commented 10 months ago

Back then, we had 2 different functions that specified the number of tasks for each op and they had to be kept in sync. If they were not in sync (i.e. return 2 different values), it could cause out-of-bounds access due to small wdata array

Now, we have only one function: ggml_get_n_tasks(), so the original issue is no longer relevant and in theory it should work with any n_task smaller than the number of rows

snadampal commented 10 months ago

great! I have raised this PR to update it.

snadampal commented 10 months ago

the PR is merged, closing the issue.