ModelCloud / GPTQModel

An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).
Apache License 2.0
90 stars 19 forks source link

[FEATURE] DeepSeek V2 Chat Support #48

Closed Xu-Chen closed 2 months ago

Xu-Chen commented 2 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

https://github.com/AutoGPTQ/AutoGPTQ/issues/664

Qubitium commented 2 months ago

@LRL-ModelCloud has been assigned to this task. Model has been downloaded and work should be completed soon.

Xu-Chen commented 2 months ago

@LRL-ModelCloud has been assigned to this task. Model has been downloaded and work should be completed soon.

Can you provide a quantified model for DeepSeek V2 Chat? I encountered an OOM error during the quantization process

Qubitium commented 2 months ago

@Xu-Chen What gpu model did you use for deepseek v2 quant? I want to check if the oom is code related or just because deepseek v2 is a little special and requires more vram.

Xu-Chen commented 2 months ago

@Xu-Chen What gpu model did you use for deepseek v2 quant? I want to check if the oom is code related or just because deepseek v2 is a little special and requires more vram.

File "/home/root/.local/lib/python3.10/site-packages/gptqmodel/models/base.py", line 258, in quantize
    move_to(module, cur_layer_device)
  File "/home/root/.local/lib/python3.10/site-packages/gptqmodel/utils/model.py", line 66, in move_to
    obj = obj.to(device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1166, in convert
    raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

quant code

  quantize_config = QuantizeConfig(
        true_sequential=False,
        bits=4,                
        group_size=group_size,    
        desc_act=desc_act
    )
    max_memory = {i: "75GB" for i in range(8)}
    model = GPTQModel.from_pretrained(
        args.model_id,
        quantize_config,
        trust_remote_code=True,
        device_map="sequential",
        attn_implementation="eager",
        torch_dtype=torch.bfloat16,
        max_memory=max_memory)
    model.quantize(examples)

Is it not possible to use the GPU to load the model?

GPU: 8 * A800-80GB RAM: 800GB

Xu-Chen commented 2 months ago

vram

delete max_memory=max_memory can run.

Is there a way to use the GPU to load the model and then perform parallel quantization to improve the quantization speed?

Qubitium commented 2 months ago

Remove all options and use just the base. GPTQModel will select the best dtype and accelerate will auto handle model weight splits.

  model = GPTQModel.from_pretrained(
        args.model_id,
        quantize_config,
  )