Closed 8XXD8 closed 1 month ago
I'm able to reproduce - not sure what is the cause. Can you help trace which commit introduces the regression?
Looks like 0226613853133c081b55bb892a41bb5eacc0bc94 introduces the regression. I believe @max-krasnyansky is working on resolving it.
This should fix the issue:
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index bccb6237..2e8c806c 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -20239,6 +20239,7 @@ enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cpl
ggml_graph_compute_thread(&threadpool->workers[omp_get_thread_num()]);
}
} else {
+ threadpool->n_threads_cur = 1;
ggml_graph_compute_thread(&threadpool->workers[0]);
}
#else
What happened?
Offloading 31 layers out of the 33 with an 8b model produces correct results, with 32 layers, the response is incoherent. 33 or more offloaded layers cause the instruction to be ignored, with
seed 1
, with any other seed, no response is printed. This affects conversational and normal modes as well.llama-server
functions without problem.Name and Version
version: 3782 (8a308354) built with clang version 20.0.0git (https://github.com/ROCm/llvm-project.git 487d0fd20dcbb6fbf926333d7b0b355788efb009) for x86_64-unknown-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output