Open mio-19 opened 10 months ago
Same problem......
Same here. After initial load model loads quickly but inference relies on CPU and is slow ....
In this scenario, the GPU is indeed utilized for token generation, but the performance bottleneck primarily lies with the CPU. This imbalance causes the GPU to frequently wait for the CPU's computation results, leading to low GPU utilization.
To leverage optimal performance advantage with PowerInfer, we generally recommend using models that are 2-3x larger than the available VRAM. In such configurations, most of the densely activated tensors can be offloaded to the GPU, while the CPU processes only the sparsely activated tensors. So there is a more balanced workload distribution between these two sides.
I'm using T4 GPU. The same as above. Only GPU RAM 0.1 GB is used
GPU RAM 0.1 / 15.0 GB
Prerequisites
Before submitting your issue, please ensure the following:
Expected Behavior
Current Behavior
Output of nvidia-smi after model is loaded
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Failure Logs
command used:
bottom part of the log