SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

24GB的显存只能占用12GB,CUDA占用也不到10%。但是CPU占用100%内存占用35GB #160

Closed NerounCstate closed 8 months ago

NerounCstate commented 8 months ago

.\build\bin\Release\main.exe -m .\ReluLLaMA-70B-PowerInfer-GGUF\llama-70b-relu.q4.powerinfer.gguf -n 128 -t 32 -p "Once upon a time" 我用这段命令试了一下效果,速度很慢而且CPU和内存占用很大,我检查了一下输出信息 llm_load_sparse_model_tensors: offloaded layers from VRAM budget(-2147483648 bytes): 81/80 llm_load_sparse_model_tensors: mem required = 40226.35 MB llm_load_sparse_model_tensors: VRAM used: 9842.91 MB 我的4090的24G显存显然只占用了一半 llama_new_context_with_model: compute buffer total size = 14.50 MB llama_new_context_with_model: VRAM scratch buffer: 12.94 MB llama_new_context_with_model: total VRAM used: 10015.84 MB (model: 9842.91 MB, context: 172.94 MB) 这里也显示占用显存为10G

hodlen commented 8 months ago

duplicate to #159