Closed Crataco closed 11 months ago
will be added in next version
Hi,
I've cloned the development branch for KoboldCpp with:
git clone -b concedo_experimental https://github.com/LostRuins/koboldcpp.git koboldcpp-dev
but I'm still facing the same problem with LLAMA_NO_K_QUANTS=1 make
.
Compiling KoboldCpp on my PC, it still places k_quants.o
in the project directory after compiling, and make clean
reports removing it as one of the many files.
My phone also has the same error as before. Here's the compilation log, if that helps:
```
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: unknown
I UNAME_M: armv7l
I CFLAGS: -I. -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations
I CXXFLAGS: -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -pthread
I LDFLAGS:
I CC: cc (Debian 12.2.0-14) 12.2.0
I CXX: g++ (Debian 12.2.0-14) 12.2.0
cc -I. -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -c ggml.c -o ggml.o
cc -I. -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -c otherarch/ggml_v2.c -o ggml_v2.o
cc -I. -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -c otherarch/ggml_v1.c -o ggml_v1.o
g++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -pthread -c expose.cpp -o expose.o
g++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -pthread -c common/common.cpp -o common.o
g++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -pthread -c gpttype_adapter.cpp -o gpttype_adapter.o
In file included from ./otherarch/llama_v2-util.h:7,
from ./otherarch/llama_v2.cpp:8,
from gpttype_adapter.cpp:18:
./otherarch/llama-util.h:56:52: warning: ‘format_old’ attribute directive ignored [-Wattributes]
56 | static std::string format_old(const char * fmt, ...) {
| ^
./otherarch/llama_v2-util.h:60:8: warning: attribute ignored in declaration of ‘struct llama_v2_file’ [-Wattributes]
60 | struct llama_v2_file {
| ^~~~~~~~~~~~~
./otherarch/llama_v2-util.h:60:8: note: attribute for ‘struct llama_v2_file’ must follow the ‘struct’ keyword
In file included from gpttype_adapter.cpp:29:
./otherarch/rwkv_v3.cpp:230:21: warning: ‘rwkv_type_to_string’ initialized and declared ‘extern’
230 | extern const char * rwkv_type_to_string[TYPE_COUNT + 1] = {"FP32", "FP16", "Q4_0", "Q4_1", "Q4_1_O", "Q4_2", "Q4_3", "Q5_0", "Q5_1", "Q8_0", "unknown"};
| ^~~~~~~~~~~~~~~~~~~
./otherarch/rwkv_v3.cpp: In function ‘ggml_tensor* rwkv_exp(ggml_context*, ggml_tensor*)’:
./otherarch/rwkv_v3.cpp:470:30: warning: ‘ggml_tensor* ggml_map_unary_f32(ggml_context*, ggml_tensor*, ggml_unary_op_f32_t)’ is deprecated: use ggml_map_custom1 instead [-Wdeprecated-declarations]
470 | return ggml_map_unary_f32(ctx, x, rwkv_exp_impl);
| ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
In file included from ./llama.h:4,
from ./common/common.h:5,
from ./otherarch/utils.h:10,
from ./otherarch/otherarch.h:14,
from gpttype_adapter.cpp:13:
./ggml.h:1548:51: note: declared here
1548 | GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_map_unary_f32(
| ^~~~~~~~~~~~~~~~~~
./ggml.h:191:41: note: in definition of macro ‘GGML_DEPRECATED’
191 | # define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
| ^~~~
./otherarch/rwkv_v3.cpp: In function ‘ggml_tensor* rwkv_1_minus_x(ggml_context*, ggml_tensor*)’:
./otherarch/rwkv_v3.cpp:474:30: warning: ‘ggml_tensor* ggml_map_unary_f32(ggml_context*, ggml_tensor*, ggml_unary_op_f32_t)’ is deprecated: use ggml_map_custom1 instead [-Wdeprecated-declarations]
474 | return ggml_map_unary_f32(ctx, x, rwkv_1_minus_x_impl);
| ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./ggml.h:1548:51: note: declared here
1548 | GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_map_unary_f32(
| ^~~~~~~~~~~~~~~~~~
./ggml.h:191:41: note: in definition of macro ‘GGML_DEPRECATED’
191 | # define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
| ^~~~
./otherarch/rwkv_v3.cpp: In function ‘ggml_tensor* rwkv_sigmoid(ggml_context*, ggml_tensor*)’:
./otherarch/rwkv_v3.cpp:478:30: warning: ‘ggml_tensor* ggml_map_unary_f32(ggml_context*, ggml_tensor*, ggml_unary_op_f32_t)’ is deprecated: use ggml_map_custom1 instead [-Wdeprecated-declarations]
478 | return ggml_map_unary_f32(ctx, x, rwkv_sigmoid_impl);
| ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./ggml.h:1548:51: note: declared here
1548 | GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_map_unary_f32(
| ^~~~~~~~~~~~~~~~~~
./ggml.h:191:41: note: in definition of macro ‘GGML_DEPRECATED’
191 | # define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
| ^~~~
./otherarch/rwkv_v3.cpp: In function ‘ggml_tensor* rwkv_max(ggml_context*, ggml_tensor*, ggml_tensor*)’:
./otherarch/rwkv_v3.cpp:482:31: warning: ‘ggml_tensor* ggml_map_binary_f32(ggml_context*, ggml_tensor*, ggml_tensor*, ggml_binary_op_f32_t)’ is deprecated: use ggml_map_custom2 instead [-Wdeprecated-declarations]
482 | return ggml_map_binary_f32(ctx, x, y, rwkv_max_impl);
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
./ggml.h:1560:51: note: declared here
1560 | GGML_DEPRECATED(GGML_API struct ggml_tensor * ggml_map_binary_f32(
| ^~~~~~~~~~~~~~~~~~~
./ggml.h:191:41: note: in definition of macro ‘GGML_DEPRECATED’
191 | # define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
| ^~~~
gpttype_adapter.cpp: In function ‘void sample_temperature(llama_token_data_array*, float)’:
gpttype_adapter.cpp:383:33: warning: ‘void llama_sample_temperature(llama_context*, llama_token_data_array*, float)’ is deprecated: use llama_sample_temp instead [-Wdeprecated-declarations]
383 | llama_sample_temperature(nullptr, candidates_p, temp);
| ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from gpttype_adapter.cpp:20:
llama.cpp:5269:6: note: declared here
5269 | void llama_sample_temperature(struct llama_context * ctx, llama_token_data_array * candidates_p, float temp) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
gpttype_adapter.cpp:388:33: warning: ‘void llama_sample_temperature(llama_context*, llama_token_data_array*, float)’ is deprecated: use llama_sample_temp instead [-Wdeprecated-declarations]
388 | llama_sample_temperature(nullptr, candidates_p, temp);
| ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llama.cpp:5269:6: note: declared here
5269 | void llama_sample_temperature(struct llama_context * ctx, llama_token_data_array * candidates_p, float temp) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
gpttype_adapter.cpp: In function ‘ModelLoadResult gpttype_load_model(load_model_inputs, FileFormat, FileFormatExtraMeta)’:
gpttype_adapter.cpp:823:49: warning: ‘int llama_apply_lora_from_file(llama_context*, const char*, float, const char*, int)’ is deprecated: use llama_model_apply_lora_from_file instead [-Wdeprecated-declarations]
823 | int err = llama_apply_lora_from_file(llama_ctx_v4,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
824 | lora_filename.c_str(),
| ~~~~~~~~~~~~~~~~~~~~~~
825 | 1.0f,
| ~~~~~
826 | lora_base_arg,
| ~~~~~~~~~~~~~~
827 | n_threads);
| ~~~~~~~~~~
llama.cpp:6942:5: note: declared here
6942 | int llama_apply_lora_from_file(struct llama_context * ctx, const char * path_lora, float scale, const char * path_base_model, int n_threads) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
gpttype_adapter.cpp:839:29: warning: ‘int llama_eval(llama_context*, llama_token*, int32_t, int)’ is deprecated: use llama_decode() instead [-Wdeprecated-declarations]
839 | auto er = llama_eval(llama_ctx_v4, tmp.data(), tmp.size(), 0);
| ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llama.cpp:7376:5: note: declared here
7376 | int llama_eval(
| ^~~~~~~~~~
gpttype_adapter.cpp: In function ‘generation_outputs gpttype_generate(generation_inputs, generation_outputs&)’:
gpttype_adapter.cpp:1512:38: warning: ‘int llama_eval(llama_context*, llama_token*, int32_t, int)’ is deprecated: use llama_decode() instead [-Wdeprecated-declarations]
1512 | evalres = (llama_eval(llama_ctx_v4, embd.data(), embdsize, n_past)==0);
| ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llama.cpp:7376:5: note: declared here
7376 | int llama_eval(
| ^~~~~~~~~~
In file included from /usr/include/c++/12/regex:55,
from model_adapter.h:6,
from gpttype_adapter.cpp:12:
/usr/include/c++/12/bits/stl_vector.h: In function ‘std::vector<_Tp, _Alloc>::vector(std::initializer_list<_Tp>, const allocator_type&) [with _Tp = long long int; _Alloc = std::allocator
and here's llama.cpp for comparison:
```
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: unknown
I UNAME_M: armv7l
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi "
I LDFLAGS:
I CC: cc (Debian 12.2.0-14) 12.2.0
I CXX: g++ (Debian 12.2.0-14) 12.2.0
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations -Wno-array-bounds -Wno-format-truncation -Wextra-semi -c llama.cpp -o llama.o
In file included from /usr/include/c++/12/vector:64,
from llama.h:725,
from llama.cpp:2:
/usr/include/c++/12/bits/stl_vector.h: In function ‘std::vector<_Tp, _Alloc>::vector(std::initializer_list<_Tp>, const allocator_type&) [with _Tp = long long int; _Alloc = std::allocator
sorry about that. can you pull my experimental branch and try again?
The recent Makefile commit seems to have fixed it!
![image](https://github.com/LostRuins/koboldcpp/assets/55674863/ac1b22ba-5b3c-4749-bf82-2aabdfe52f4d)
I tried out several models under 3GB of RAM. I made sure to disable as many background apps as I could: - **[AI Dungeon 2 Classic](https://huggingface.co/Crataco/AI-Dungeon-2-Classic-GGML) q4_0:** started, but gave an OOM error on the first message and crashed on the second message. - **[Pygmalion 1.3B](https://huggingface.co/Crataco/Pygmalion-1.3B-GGML) q4_0:** worked well, though outputs were strange. I blame the quantization, as q5_1 felt better when I ran it on my PC. - **[RWKV-4 World](https://huggingface.co/Crataco/RWKV-4-World-Series-GGML) q4_0:** worked at 2.4 - 2.6t/s, processed the prompt at about 3.2t/s and reprocessed the whole chat history every message. - **[RWKV-4 World](https://huggingface.co/Crataco/RWKV-4-World-Series-GGML) q5_1:** worked at 1.3t/s, processed the prompt at 1.4 t/s. As expected of higher quants, I feel like the results were slightly better. - **[TinyLLaMA 1.1B Chat v0.2](https://huggingface.co/kirp/TinyLlama-1.1B-Chat-v0.2-gguf) q5_0:** Crashes with a "floating point exception" error, but this might be related to upstream llama.cpp requiring a patch to run TinyLLaMA. **Notes:** - Compiling with OpenBLAS gave me a *"cannot enable executable stack as shared object requires: permission denied"* error once starting KoboldCpp. CLBlast depends on the GPU, but GPU acceleration isn't so straightforward in Termux. So I'll stick with the slow processing times. I don't have the energy to troubleshoot this further lol - While KoboldCpp (RWKV-4-World q5_1) was running, I created a separate proot-distro container for SillyTavern using Alpine. But Termux crashes when I go to my home screen, so I probably don't have enough RAM to juggle both. ***
Thank you so much for working on this issue. The k-quant issue is now resolved, and I'm happy to run LLMs on my 32-bit ARM Androids (it's honestly a dream come true).
I'll close this issue unless I can't compile without k-quants sometime in the future.
Glad you got it working.
Hello!
I've been trying to get llama.cpp running on my phone (32-bit ARM, 3GB RAM) via proot-distro Debian on Termux (since I've had my fair share of problems with native Termux).
When the compilation gets to the k_quant files, llama.cpp fails to build (see here). But after running
LLAMA_NO_K_QUANTS=1 make
, it succeeds and just barely works, at a speed of ~6-7 seconds per token for OpenLLaMA v2 3B q4_0.Now, KoboldCpp has always been my preferred frontend for old/weak devices since I can run tiny non-Llama models, but it lacks the Makefile variable (if that's the right term?) that skips the k_quant step.
I would really appreciate if this was implemented in KoboldCpp for those of us who can't compile k_quant.
Thanks in advance!