Closed snadampal closed 10 months ago
Not really sure where the limit of 4 came from - probably I did some measurements and decided that there is no reason to use larger number of tasks. But it's likely something to be revisited and updated
thanks. since it was done to take care of out of bounds issue, could you please share some details about the original issue? so that we can try to arrive at some formula to take care of systems with different number of threads.
Back then, we had 2 different functions that specified the number of tasks for each op and they had to be kept in sync. If they were not in sync (i.e. return 2 different values), it could cause out-of-bounds access due to small wdata
array
Now, we have only one function: ggml_get_n_tasks()
, so the original issue is no longer relevant and in theory it should work with any n_task
smaller than the number of rows
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
System: AWS Graviton3, c7g.16xl instance with Ubuntu 22.04 llama.cpp version: latest, commit: 6f9939d119b2d004c264952eb510bd106455531e
the following commit is capping the num of tasks to 4. I would like to understand why 4?
Without 4 and just using the src num rows or n_threads for n_tasks, the prompt eval performance is improved by 4% for DOT kernels and 9% for MMLA kernels (PR)
n_tasks = MIN(n_threads, ggml_nrows(node->src[0]));
Reproducer:
./main -m /llama.cpp/models/open_llama_13b/ggml-model-q8_0.gguf -c 1015 -n 256 -t 64 --file <input_file.txt>