Closed NerounCstate closed 8 months ago
.\build\bin\Release\main.exe -m .\ReluLLaMA-70B-PowerInfer-GGUF\llama-70b-relu.q4.powerinfer.gguf -n 128 -t 32 -p "Once upon a time" 我用这段命令试了一下效果,速度很慢而且CPU和内存占用很大,我检查了一下输出信息 llm_load_sparse_model_tensors: offloaded layers from VRAM budget(-2147483648 bytes): 81/80 llm_load_sparse_model_tensors: mem required = 40226.35 MB llm_load_sparse_model_tensors: VRAM used: 9842.91 MB 我的4090的24G显存显然只占用了一半 llama_new_context_with_model: compute buffer total size = 14.50 MB llama_new_context_with_model: VRAM scratch buffer: 12.94 MB llama_new_context_with_model: total VRAM used: 10015.84 MB (model: 9842.91 MB, context: 172.94 MB) 这里也显示占用显存为10G
duplicate to #159
.\build\bin\Release\main.exe -m .\ReluLLaMA-70B-PowerInfer-GGUF\llama-70b-relu.q4.powerinfer.gguf -n 128 -t 32 -p "Once upon a time" 我用这段命令试了一下效果,速度很慢而且CPU和内存占用很大,我检查了一下输出信息 llm_load_sparse_model_tensors: offloaded layers from VRAM budget(-2147483648 bytes): 81/80 llm_load_sparse_model_tensors: mem required = 40226.35 MB llm_load_sparse_model_tensors: VRAM used: 9842.91 MB 我的4090的24G显存显然只占用了一半 llama_new_context_with_model: compute buffer total size = 14.50 MB llama_new_context_with_model: VRAM scratch buffer: 12.94 MB llama_new_context_with_model: total VRAM used: 10015.84 MB (model: 9842.91 MB, context: 172.94 MB) 这里也显示占用显存为10G