SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

The critical code for deciding which layer to put on cpu or gpu #142

Closed YuMJie closed 9 months ago

YuMJie commented 9 months ago

I noticed there is a file named "solver.py" in PowerInfer, but I didn't see the solver being used in the code. Can you point out the critical code that decides whether to place the layer on the CPU or GPU? Additionally, can you indicate which part of the code is responsible for feeding the input into the model and retrieving the output?

Tan-YiFan commented 9 months ago

the critical code that decides whether to place the layer on the CPU or GPU

I guess it's in the function flush: https://github.com/SJTU-IPADS/PowerInfer/blob/52ff38f963c1dbb924660909e27793d6796d13c3/llama.cpp#L2929-L2960

which part of the code is responsible for feeding the input into the model and retrieving the output

It is likely to be in examples/main/main.cpp and llama.cpp (function llama_decode).

hodlen commented 9 months ago

PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in buffered_allocator.flush()). Then it partially offloads hot neurons in FFN (in llm_load_gpu_split), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (in llm_load_gpu_split_with_budget).

YuMJie commented 9 months ago

the critical code that decides whether to place the layer on the CPU or GPU

I guess it's in the function flush:

https://github.com/SJTU-IPADS/PowerInfer/blob/52ff38f963c1dbb924660909e27793d6796d13c3/llama.cpp#L2929-L2960

which part of the code is responsible for feeding the input into the model and retrieving the output

It is likely to be in examples/main/main.cpp and llama.cpp (function llama_decode).

Thanks for your answer and help!

PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in buffered_allocator.flush()). Then it partially offloads hot neurons in FFN (in llm_load_gpu_split), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (in llm_load_gpu_split_with_budget).

Thanks for your answer and great work!