Closed YuMJie closed 9 months ago
the critical code that decides whether to place the layer on the CPU or GPU
I guess it's in the function flush
: https://github.com/SJTU-IPADS/PowerInfer/blob/52ff38f963c1dbb924660909e27793d6796d13c3/llama.cpp#L2929-L2960
which part of the code is responsible for feeding the input into the model and retrieving the output
It is likely to be in examples/main/main.cpp
and llama.cpp
(function llama_decode).
PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in buffered_allocator.flush()
). Then it partially offloads hot neurons in FFN (in llm_load_gpu_split
), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (in llm_load_gpu_split_with_budget
).
the critical code that decides whether to place the layer on the CPU or GPU
I guess it's in the function
flush
:which part of the code is responsible for feeding the input into the model and retrieving the output
It is likely to be in
examples/main/main.cpp
andllama.cpp
(function llama_decode).
Thanks for your answer and help!
PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in
buffered_allocator.flush()
). Then it partially offloads hot neurons in FFN (inllm_load_gpu_split
), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (inllm_load_gpu_split_with_budget
).
Thanks for your answer and great work!
I noticed there is a file named "solver.py" in PowerInfer, but I didn't see the solver being used in the code. Can you point out the critical code that decides whether to place the layer on the CPU or GPU? Additionally, can you indicate which part of the code is responsible for feeding the input into the model and retrieving the output?