Closed AmazDeng closed 1 month ago
@lvhan028 @AllentDan @grimoire @irexyc @RunningLeon @lzhangzz @zhyncs @zhulinJulia24 @tpoisonooo @pppppM @ispobock @wangruohui @Harold-lkk @HIT-cwh Could you please take a look at this issue?
Without kv-cache, the 40B model needs about 78G memory to load the weights.
To load and inference the model, I think you shoud use at least two A100 or use the awq quant model https://huggingface.co/OpenGVLab/InternVL2-40B-AWQ
Without kv-cache, the 40B model needs about 78G memory to load the weights.
To load and inference the model, I think you shoud use at least two A100 or use the awq quant model https://huggingface.co/OpenGVLab/InternVL2-40B-AWQ
path = '/media/star/disk2/pretrained_model/InternVL2-40B'
device_map = split_model('InternVL2-40B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
So, after deploying with lmdeploy, shouldn’t the VRAM usage be smaller than when loading the same model with transformers?
When you set load_in_8bit=True
, it will use bitsandbytes
to quant the model so you can load the model with less gpu memory. Without load_in_8bit=True
the AutoModel.from_pretrained
will takes up about 77G memory.
In terms of quantized model, gtpq/awq is better than bitsandbytes
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
Checklist
Describe the bug
I followed the official documentation for InternVL2 and used lmdeploy to load the 40B model(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html), but I encountered an error:
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32
. My machine is an A100 80G. What could be the issue? lmdeploy officially supports the InternVL2 model.Reproduction
Environment