b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

feat: vulkan. #91

Closed b4rtaz closed 6 days ago

b4rtaz commented 2 weeks ago

The experimental Vulkan support for the matrix multiplication.


To try run Distributed Llama with the Vulkan support you need clone this branch. Also you need to have installed Vulkan dev environment to compile shaders and Distributed Llama with the -lvulkan lib.

  1. Build Distributed Llama:
    make dllama DLLAMA_VULKAN=1
  2. Run Distributed Llama with the --accelerator=1/1 argument.
    ./dllama inference --accelerator 1/1 \
    --buffer-float-type q80 --prompt "Hello" --steps 128 --nthreads 1 --model models/llama3_8b_q40/dllama_model_llama3_8b_q40.m \
    --tokenizer models/llama3_8b_q40/dllama_tokenizer_llama3_8b_q40.t

The value for this argument defines "what percent of the computation should be moved to GPU". 1/1 means 100%. 1/2 means 50% etc.

The current implementation tries to run the inference on CPU and GPU simultaneously. You can still control how many threads should be used by setting the --nthreads 4 argument. So basically the goal it to find the best values for --nthreads argument and the --accelerator 1/3 argument, to achieve the best speed.

unclemusclez commented 2 weeks ago
unclemusclez@ttv:~/distributed-llama/src/vulkan$ ls
matmul-f32-f32.comp  matmul-q40-f32.spv   matmul-q40-q80.spv
matmul-f32-f32.spv   matmul-q40-q80.comp
matmul-q40-f32.comp  matmul-q40-q80.sp
unclemusclez@ttv:~/distributed-llama/src/vulkan$ cd ../..
unclemusclez@ttv:~/distributed-llama$ make dllama DLLAMA_VULKAN=1
Makefile:55: warning: overriding recipe for target 'funcs-test'
Makefile:29: warning: ignoring old recipe for target 'funcs-test'
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/ut                                                                                                                                                             ils.cpp -o utils.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/qu                                                                                                                                                             ants.cpp -o quants.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/fu                                                                                                                                                             ncs.cpp -o funcs.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/co                                                                                                                                                             mmands.cpp -o commands.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/so                                                                                                                                                             cket.cpp -o socket.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/tr                                                                                                                                                             ansformer.cpp -o transformer.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/ta                                                                                                                                                             sks.cpp -o tasks.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/ll                                                                                                                                                             ama2-tasks.cpp -o llama2-tasks.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/gr                                                                                                                                                             ok1-tasks.cpp -o grok1-tasks.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/mi                                                                                                                                                             xtral-tasks.cpp -o mixtral-tasks.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/to                                                                                                                                                             kenizer.cpp -o tokenizer.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/ap                                                                                                                                                             p.cpp -o app.o
g++ -std=c++11 -Werror -O3 -march=native -mtune=native -DDLLAMA_VULKAN -c src/ac                                                                                                                                                             celerator-vulkan.cpp -o accelerator-vulkan.o
src/accelerator-vulkan.cpp: In constructor 'VulkanContext::VulkanContext()':
src/accelerator-vulkan.cpp:85:37: error: format '%llu' expects argument of type                                                                                                                                                              'long long unsigned int', but argument 3 has type 'vk::DeviceSize' {aka 'long un                                                                                                                                                             signed int'} [-Werror=format=]
   85 |             printf("🌋 heap[%u]: %llu MB\n", h, memoryProperties.memoryH                                                                                                                                                             eaps[h].size / (1024 * 1024));
      |                                  ~~~^
      |                                     |
      |                                     long long unsigned int
      |                                  %lu
cc1plus: all warnings being treated as errors
make: *** [Makefile:17: accelerator-vulkan.o] Error 1
b4rtaz commented 6 days ago

Unfortunately the approach applied in this PR is not a good direction. I have to revise it.