Open songkq opened 10 months ago
Working on it!
666
@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?
@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?
I haven't tested on P100. If ggml support P100, it will work.
@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
33 | if (start_idx + skip + 2 < parts.size()) {
| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
| ^~~~~~
| |
| ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
1238 | int n_past,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
| ^~~~~~
| |
| ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
1238 | int n_past,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function: /workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare] 33 | if (start_idx + skip + 2 < parts.size()) { | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const': /workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2 gmake: *** [Makefile:136: all] Error 2
Solved by updating submodules.
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function: /workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare] 33 | if (start_idx + skip + 2 < parts.size()) { | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const': /workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2 gmake: *** [Makefile:136: all] Error 2
Solved by updating submodules.
Yes!
@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
I don't think it will work. I will modify llama.cpp's server code for qwen.cpp.
@simonJJJ I'm wondering whether we can develop a Triton backend (https://github.com/triton-inference-server/backend) for qwen.cpp. Then qwen.cpp can work with a Triton Inference Server.
@simonJJJ Hi, could you please give some advice for this issue?
input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000
GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)
@simonJJJ Hi, could you please give some advice for this issue?
input query len > 4000 gen_config w/ max_length = 8192, max_context_length = 5000 GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)
Solved by increasing value of MEM_SIZE
and SCRATCH_SIZE
in qwen.h.
Updating GGML can be a better solution for long context inference.
Do you have plan to support Qwen inference on CUDA device? It seems too slowly on Mac M1.