Closed Xu-Chen closed 2 months ago
@LRL-ModelCloud has been assigned to this task. Model has been downloaded and work should be completed soon.
@LRL-ModelCloud has been assigned to this task. Model has been downloaded and work should be completed soon.
Can you provide a quantified model for DeepSeek V2 Chat? I encountered an OOM error during the quantization process
@Xu-Chen What gpu model did you use for deepseek v2 quant? I want to check if the oom is code related or just because deepseek v2 is a little special and requires more vram.
@Xu-Chen What gpu model did you use for deepseek v2 quant? I want to check if the oom is code related or just because deepseek v2 is a little special and requires more vram.
File "/home/root/.local/lib/python3.10/site-packages/gptqmodel/models/base.py", line 258, in quantize
move_to(module, cur_layer_device)
File "/home/root/.local/lib/python3.10/site-packages/gptqmodel/utils/model.py", line 66, in move_to
obj = obj.to(device)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1166, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
quant code
quantize_config = QuantizeConfig(
true_sequential=False,
bits=4,
group_size=group_size,
desc_act=desc_act
)
max_memory = {i: "75GB" for i in range(8)}
model = GPTQModel.from_pretrained(
args.model_id,
quantize_config,
trust_remote_code=True,
device_map="sequential",
attn_implementation="eager",
torch_dtype=torch.bfloat16,
max_memory=max_memory)
model.quantize(examples)
Is it not possible to use the GPU to load the model?
GPU: 8 * A800-80GB RAM: 800GB
vram
delete max_memory=max_memory can run.
Is there a way to use the GPU to load the model and then perform parallel quantization to improve the quantization speed?
Remove all options and use just the base. GPTQModel will select the best dtype and accelerate will auto handle model weight splits.
model = GPTQModel.from_pretrained(
args.model_id,
quantize_config,
)
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
https://github.com/AutoGPTQ/AutoGPTQ/issues/664