The critical code for deciding which layer to put on cpu or gpu

YuMJie commented 9 months ago

I noticed there is a file named "solver.py" in PowerInfer, but I didn't see the solver being used in the code. Can you point out the critical code that decides whether to place the layer on the CPU or GPU? Additionally, can you indicate which part of the code is responsible for feeding the input into the model and retrieving the output?

Tan-YiFan commented 9 months ago

the critical code that decides whether to place the layer on the CPU or GPU

I guess it's in the function flush: https://github.com/SJTU-IPADS/PowerInfer/blob/52ff38f963c1dbb924660909e27793d6796d13c3/llama.cpp#L2929-L2960

which part of the code is responsible for feeding the input into the model and retrieving the output

It is likely to be in examples/main/main.cpp and llama.cpp (function llama_decode).

hodlen commented 9 months ago

PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in buffered_allocator.flush()). Then it partially offloads hot neurons in FFN (in llm_load_gpu_split), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (in llm_load_gpu_split_with_budget).

YuMJie commented 9 months ago

the critical code that decides whether to place the layer on the CPU or GPU

I guess it's in the function flush:

https://github.com/SJTU-IPADS/PowerInfer/blob/52ff38f963c1dbb924660909e27793d6796d13c3/llama.cpp#L2929-L2960

which part of the code is responsible for feeding the input into the model and retrieving the output

It is likely to be in examples/main/main.cpp and llama.cpp (function llama_decode).

Thanks for your answer and help!

PowerInfer uses a heuristic policy to determine GPU offloading based on available VRAM. It first offloads tensor by tensor (implemented in buffered_allocator.flush()). Then it partially offloads hot neurons in FFN (in llm_load_gpu_split), where the solver is invoked to generate a GPU index to indicate which rows/columns to be offloaded (in llm_load_gpu_split_with_budget).

Thanks for your answer and great work!

SJTU-IPADS / PowerInfer

The critical code for deciding which layer to put on cpu or gpu #142