Plan to Support CUDA device

songkq commented 10 months ago

Do you have plan to support Qwen inference on CUDA device? It seems too slowly on Mac M1.

simonJJJ commented 10 months ago

Working on it!

iDonal commented 10 months ago

666

songkq commented 9 months ago

@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?

simonJJJ commented 9 months ago

@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?

I haven't tested on P100. If ggml support P100, it will work.

songkq commented 9 months ago

@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

songkq commented 9 months ago

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

songkq commented 9 months ago

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Solved by updating submodules.

simonJJJ commented 9 months ago

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Solved by updating submodules.

Yes!

simonJJJ commented 9 months ago

@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

I don't think it will work. I will modify llama.cpp's server code for qwen.cpp.

songkq commented 9 months ago

@simonJJJ I'm wondering whether we can develop a Triton backend (https://github.com/triton-inference-server/backend) for qwen.cpp. Then qwen.cpp can work with a Triton Inference Server.

songkq commented 8 months ago

@simonJJJ Hi, could you please give some advice for this issue?

input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000

GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)

songkq commented 8 months ago

@simonJJJ Hi, could you please give some advice for this issue?

input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000

GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)

Solved by increasing value of MEM_SIZE and SCRATCH_SIZE in qwen.h. Updating GGML can be a better solution for long context inference.

QwenLM / qwen.cpp

Plan to Support CUDA device #3