Closed David-AU-github closed 6 months ago
shed some light on these results
Hi. It appears you've done significant testing, but there isn't a result in your message. Are you using main
? What's your command line? Did you set --temp 0
, and identical seed? https://github.com/ggerganov/llama.cpp/blob/4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc/examples/main/README.md?plain=1#L263
Please note, all that's needed is 1 clear example. Assuming you're running main
(without bindings, or something inbetween llama.cpp
), here's an example showing how to reproduce an issue:
./main ~/model.gguf --temp 0 --seed 3 -p "2+2="
output compared to
./main ~/model.gguf --temp 0 --seed 3 --ngl 99 -p "2+2="
I would be interested in testing 7B models myself. With what models do you see the biggest difference in quality? What kind of tests do you run them through?
Here is the prompt and method to reproduce the results.
For clarity GPU only and CPU only. (I can also create a PDF with the results too, as per test output is 1000-2000 tokens each).
On all testing of 500+ models (which also includes comparisons between Qs and IQ of the same model and comparisons against the same model's GPTQ, AWQ, EXL2 ) the testing, parameters, and prompt are exactly the same. This has been maintained in 6+ months of testing.
TEST 1 - IQ tests (low IQ used to constrast the differences more sharply): MythoLogic-L2-13b.i1-IQ1_M.gguf 3.35 GPU: Great, but 3rd person (?) 2nd test -> First person, quality same as 1st gpu. CPU/GPU: Excellent, maybe short of "cpu only". CPU: Excellent, maybe off the charts.
TimeLess-20B.i1-IQ1_M.gguf 4.98 GPU: great but short. (2) 32 t/s CPU-GPU: excellent. (1 or +1) 6.5 T/s [context still on GPU?] CPU: excellent ++ (at or above 1), but short 4 t/s - 2nd regen better. [context still on GPU?] CPU offload all: equal to "cpu" , 1st short ... 2nd 3rd person [leng better] - 4 t/s [nothing, including context on GPU]
TimeLess-20B.i1-IQ1_S.gguf 4.61 GPU: great [2] 32 t/s CPU-GPU: excellent. [1] 9 t/s CPU: excellent ++ at / above 1 5.4 t/s CPU offload all: excellent +++ 5 t/s - short sent, breaks in convor, desc, everything. [nothing, including context on GPU]
TEST Group 2 - Reg Qs :
DavidAU/DarkSapling-7B-v1.0-Q6_K-GGUF/darksapling-7b-v1.0.Q6_K.gguf DavidAU/DarkSapling-7B-v1.1-Q6_K-GGUF/darksapling-7b-v1.1.Q6_K.gguf DavidAU/DarkSapling-7B-v2.0-Q6_K-GGUF/darksapling-7b-v2.0.Q6_K.gguf Lewdiculous/KukulStanta-7B-GGUF-IQ-Imatrix/KukulStanta-7B-Q8_0-imat.gguf TheBloke/Seraph-7B-GGUF/seraph-7b.Q8_0.gguf bartowski/Tess-7B-v2.0-GGUF/Tess-7B-v2.0-Q8_0.gguf
This test group - when run with GPU only, and then CPU only highlights stark differences in output quality. Especially of note is the first model in this series which is "twitchy" on GPU generation, yet perfectly fine on CPU only generation. On GPU it goes over context, goes into "repeat" mode at end of context, and in extreme cases will crash llama.cpp/LMS.
On CPU - no issues. Stops when it should, context is coherent, and detailed. Likewise for the other 5 models run on cpu only. In fact just visually speaking CPU output of all 6 models are almost the same at the visual level (not reading at all), whereas GPU output is all over the place (paragraph issues, prose, spacing and the like).
Note I am running windows 11, with Nvidia 4060ti 16 GB (nov 2023).
Subjective differences: Sentence structure (and variety), word choice, description detail, general overall "there", prose errors, use of specific low quality words, output length (vs instructions), general creativity, and following of the instructions in the case of suspense, tension and other elements in the instructions.
Here is the master test prompt:
Using the following "story idea" below, write the first scene in the novel introducing the young woman. This scene should start in the middle of the action, include dialog, vivid passages, and end on a cliffhanger relevant to the story idea but it should also be unexpected. The scene should be 1000 words long and escalate in conflict and suspense and be written in first person, present tense with the point of view character being the young woman.
Story idea:
In a world ruled by dictatorship, a rebel young woman leads a rebellion against the system. Despite the risks, she fights to overthrow the dictator and restore democracy to her country. The government executes her for treason, but she sticks to her beliefs and is responsible for starting the revolution.
Here is the system role: Below is an instruction that describes a task. Write a response that appropriately completes the request.
Parameters: Temp: .8 ; topk: 40 ; repeat 1.1; minp .05 ; topp .95 (these are LMS defaults, used for all testing last 6 months)
Let me know if a pdf gen would help ; Thanks DAVE
Thanks for clarfying you used LMStudio instead of llama.cpp
. No, a LMStudio results pdf doesn't help.
Thanks for clarfying you used LMStudio instead of
llama.cpp
. No, a LMStudio results pdf doesn't help.
Can you confirm in recent LLAMA.cpp updates if there has been a change in processing at the token level in terms of math calculations specifically rounding? (either in quanize and/or at the "processing" level?)
I understand that "rounding errors" can compound thru the LLM. I also understand that if you "cut down" calculations this can increase T/S at the cost of quality.
What I am detecting is a "drop in nuance" which seems to point to possible changes in the "math" of LLM operations.
Even if there is a small change in terms of "rounding" - this could drop final quality output by 1%? 5%? 10%?
IE: Quality (more "math") vs speed sacrifice (less "math")?
LMStudio uses LLAMA.CPP, so this is an issue.
Can you confirm in recent LLAMA.cpp updates if there has been a change in processing at the token level
@david565656 Has math changed? Probably. It's possible there's a valid issue buried in your LMStudio testing.
LMStudio uses LLAMA.CPP, so this is an issue.
I understand LMStudio is a downstream project. However; not all LMStudio issues are llama.cpp
issues. You'd need to provide something testable, specifically related to llama.cpp
usage, so that people here can judge the issue.
Thank you for clarifying. I have extensively worked with DEVs Lmstudio to see if the root of the problem is downstream prior to this ticket.
RE: Testing Llama.cpp -> Test in "chat" (examples) with the above test prompt, 5 gens with GPU only, 5 with CPU only.
Suggest testing with IQ2 level for higher contrast. The test prompt I use is very difficult for most LLMs to handle and it is also missing instructions on purpose to reveal inner LLM workings / issues and training.
My concerns are multi-fold:
1 - If this is NOT a llama.cpp and/or LMStudio then this would make a unique enhancement for LLAMA.CPP - which would result in lower T/S but a marked increase in quality output. Even a 10% offload (to cpu) could be a huge quality improvement, especially if this is targeted to specific layer(s) and/or groups of layers. For MOEs this could be even more profound.
So far every model I have tested with partial CPU offload has shown quality improvements. This varies by model and quant level.
EXAMPLE: If this was applied at the 1st 3 layers, last 3 layers or both -> game changer is possible. A quality enhancement of even 1% - that can be worth it generally, and in specific use cases - a night and day difference.
2 - If this is a math issue - llama.cpp, cuda, lmstudio, Nvidia driver etc -> then this should be investigated. As an alternative an enhancement could be made introducing a new parameter IE "Q" ; which would affect math precision during inference. IE: 0 for maximum speed t/s and general quality, .5 for "balanced" and "1" for maximum quality. As a programmer I understand this would not be a small undertaking. This could enhance all models and their output based on use case and/or specific needs per prompt / API call.
3 - I don't want to see "closed source" to win. AI belongs to the people, not corporations. Anything that can be done to improve quality is paramount in my mind.
I have extensively worked with DEVs Lmstudio to see if the root of the problem is downstream prior to this ticket. ... My concerns are multi-fold
@david565656 OK, based on how you dodged my request, I think you will not provide a llama.cpp
example.
If you have an enhancement to contribute, then open a PR, and the maintainers will review your contribution.
Sorry was not clear and not "dodging" the request:
1 - Suggestion was to test via : https://github.com/ggerganov/llama.cpp/tree/master/examples/server/ I provided the exact prompts, settings and parameters noted in previous message in this thread.
I can not run llama.cpp on my local machine due to security issues related install procedure noted : windows install...
On Windows:
Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
Extract w64devkit on your pc.
Run w64devkit.exe.
*** "Fortran" version download/install caused issues.
If you can point me to a colab, I can try it there.
2 - RE: Ticket - > Enhancements -> Okay, Will do.
I provided the exact prompts, settings and parameters noted in previous message in this thread.
@david565656 you're like someone that goes to mcdonalds, and questions the ingredients about food you ate at burgerking. You confirmed you don't use llama.cpp
, so your prompt, and settings don't necessarily apply.
I cannot run llama.cpp on my local machine due to security issues related install procedure noted:
https://github.com/ggerganov/llama.cpp/issues/103#issuecomment-1466990359 suggested
cmake -S . -B build/ -D CMAKE_BUILD_TYPE=Release
cmake --build build/ --config Release
Afterwards, the exe files should be in the build/Release folder.
@jeximo ; I understand exactly how upstream / down streams work... because I have built them before. LmStudio, Webui, Koldboldcpp all use llama.cpp , Ollama does too, and Jan. What happens in llamacpp affect all of these interfaces and systems. THAT is what I was bringing to your attention. An issue had been detected, and I bought it your attention. The notes re Nvidia Cuda Toolkit further confirmed my observations, as did other information online RE: differences between GPU and CPU math accuracy. As for burgerking and macdonalds - both share almost the same ingredients and result in the same regrets.
An issue had been detected, and I bought it your attention.
@david565656 As you don't use llama.cpp
, it's impossible to dicern that your issue is directly related to it.
I've provided instructions, and a link to install llama.cpp
for windows. It's up to you to show a problem.
I have installed llamacpp locally 2 days ago ; and published a list of issues / corrections for win 11 users associated with the install (in a discussion at llamacpp where other win 11 users where running into the same issues). I also published "work-arounds" for specific GPU Nvidia specific issues as well related to install of llamacpp on windows.
This is not a llamacpp issue per say - install on windows is always an issue and Nvidia's issues/Toolkit and Visual Studio/C compound issues of install greatly as well as errors/bugs in Nvidia Toolkit -especially upgrades from one version to the next and related mismatch errors which compound and create issues using GPU with llamacpp
That being said, users are blaming llamacpp (wrongly) for faults with Windows, Visual Studio, Mini/Anaconda and related problems/issues in Python. A comprehensive how to install on Windows (with full troubleshooting) for Llamacpp would be very helpful to a lot of users.
SIDE NOTE: I have performed 80+ experiments with quants, Imatrix, perplexity, changing the imatrix DAT files (including all quantize.exe options and combos ) and related for fine tuning purposes as well as learning the full width and breath of the options available. I may publish/share the results of this as more data is accumulated.
@David-AU-github please, share your findings
Side Note "Results":
1 - Standard Imatrix datasets can have a limited to huge impact on imat compressions. In some cases "standard" datasets can actually damage the model vs "fix it" under certain conditions including specific types of prompts. It is unclear why this occurs at this time. But it does happen. It was a surprise when I was testing for other reasons.
2 - Strongly suggested to use a stronger Imatrix dataset such as wiki.raw for overall stability , especially at IQ3 and lower. In fact lower quants should have a different imatrix dataset than higher ones on a model by model basis.
3 - Output (ggml file from convert) should be set at f32 IF the files to be "GGUF'ed" are in f32 (ie orca 2, tinyllama). Likewise if a merge model is created in f32 / float32 the GGML file should ALSO be in f32 ; at there are large increases in quality when using quantize.exe .
Example: Merge in bfloat16/ggml f16 (20B) q4km (non imatrix): 8.7446 +/- 0.15265 (wiki.raw) Same merge in float32 / ggml f32 (20B) q4km (non imatrix) : 8.7074 +/- 0.15199 (wiki.raw)
20B model: this means 80GB files (f16) VS 160GB (f32) of files (GGML + Safetensors merge files).
To put that in perspective this can mean "night and day" difference between an IQ1/IQ2 (differences in PPL (f16 vs f32) will be far higher at this level) quality output, and at q6/q8 at even higher quality.
In fact the difference especially at IQ1 can mean the difference between the IQ1 version functioning and not functioning.
The differences in output (text gen) are noticeable too ; using "TEMP 0" to compare FOR ALL QUANTS ; with differences between neighboring quants reducing as you move up in "bits".
GPU vs CPU: There is a slight but noticeable difference between generation - again, tested at TEMP 0.
To test output gen differences: Use prompt(s) with NO right answer...
@David-AU-github Thanks! That is pretty insightful. Consider making a post in /r/localllama
Windows 11 (24 core/32 processor) (nov 2023, 6MHZ processor) , 64 GIG ram, Nvidia 16 GB card (GEforce RTX 4060TI ) , version LLAMA.CPP mar 31 2024.
I have noticed some anomalies after testing close to 500 GGUF models over the past 6 months. I have a standardized method of testing models, and record the results (and any issues) and grade them from 1 to 10 (1 being top). This includes models from 1B to 70B in size, with some testing of 103/120 B models. This covers multiple quants as well as Imatrix quants including standard models, MOEs (all sizes, configs) and merged models.
THE ISSUE: Specifically differences between CPU only, GPU/CPU split, and GPU only processing of instructions and output quality.
In some cases CPU VS GPU : CPU performance - in terms of quality is much higher than GPU only. In some cases CPU/GPU (split 50,50) is superior to GPU only quality.
Testing involves getting a GPU baseline, CPU baseline and then GPU/CPU baseline and comparing carefully.
RESULT DIFFERENCES:
In some cases GPU/CPU split quality is 1 to 2 points HIGHER than GPU only output. In some cases CPU only is 1 point to 2 points higher than GPU output. (note: Prompt the same, no change in any parameters)
To be clear - this is not to infer than GPU performance is not exceptional - it is. I can not explain the differences in results, but I can detect them. (??)
IE: Running 13B model @ Q8 : First in GPU only, then GPU only, and then 50/50 split. I tested this across a spectrum of models including factors such as age (IE created 4 months ago vs 2 weeks ago - all same "Q8").
As of this writing further experimentation is underway to ascertain the "sweet spot" between GPU/CPU "splitting" to optimize T/S and quality - where quality is the priority. (IE: 25% cpu with 75% GPU, % of layers offload to CPU etc etc)
I was wondering if you could shed some light on these results as it might be the gateway to further LLAMA.CPP quants of a hybrid nature as well as enhanced quality overall - instruction following and output.
I have to say also, that after testing close to 100 IMAT/IMATRIX quants (as well as comparing to "old" Qs and GPTQ/AWQ etc) that you guys have knocked it out of the park with this new post compression quality process.