[BUG] TGI does not support DeepSeekCoderV2-gptq

Cucunnber commented 1 month ago

Describe the bug

I get the error Cannot load gptq weight for GPTQ -> Marlin repacking, make sure the model is already quantized when i inference gptq quantized model DeepSeekCoderV2 with Text-generation-inference 2.2.0.

GPU Info

A100-80GB * 4

config.json

{
  "_name_or_path": "/var/mntpkg/deepseek-coder-v2-instruct",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 12288,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1536,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 128,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 60,
  "num_key_value_heads": 128,
  "pretraining_tp": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.005,
    "desc_act": true,
    "dynamic_bits": null,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.10-dev0"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 16.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 3,
  "topk_method": "group_limited_greedy",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 102400
}

quantize_config.json

{
  "bits": 4,
  "dynamic_bits": null,
  "group_size": 128,
  "desc_act": true,
  "static_groups": false,
  "sym": true,
  "lm_head": false,
  "damp_percent": 0.005,
  "true_sequential": true,
  "model_name_or_path": "deepseek-coder-v2-instruct-gptq",
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "gptqmodel:0.9.10-dev0"
 }

To Reproduce

docker run -d \
--gpus '"device=4,5,6,7"' \
--shm-size 1g \
--name $model_name \
-p ${external_port}:80 -v $model_path:/data/CmwCoder \
-e WEIGHTS_CACHE_OVERRIDE="/data/CmwCoder" \
tgi:2.2.0 \
--weights-cache-override="/data/CmwCoder" \
--model-id "/data/CmwCoder" --num-shard $num_shard \
--max-input-length 14000 \
--max-total-tokens 16000 \
--max-batch-prefill-tokens 14000 \
--trust-remote-code \
--quantize gptq

Model/Datasets

DeepSeekCoderV2-236B-MOE

Screenshots

2024-08-02 03:31:18.315 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type deepseek_v2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[rank0]: Traceback (most recent call last):

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/gptq/__init__.py", line 153, in get_weights
[rank0]:     qweight = weights.get_tensor(f"{prefix}.qweight")

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
[rank0]:     filename, tensor_name = self.get_filename(tensor_name)

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
[rank0]:     raise RuntimeError(f"weight {tensor_name} does not exist")

[rank0]: RuntimeError: weight model.layers.59.self_attn.q_a_proj.qweight does not exist

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):

[rank0]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank0]:     sys.exit(app())

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
[rank0]:     server.serve(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
[rank0]:     asyncio.run(

[rank0]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)

[rank0]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
[rank0]:     model = get_model(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 490, in get_model
[rank0]:     return FlashCausalLM(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
[rank0]:     model = model_class(prefix, config, weights)

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 764, in __init__
[rank0]:     self.model = DeepseekV2Model(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 703, in __init__
[rank0]:     [

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 704, in <listcomp>
[rank0]:     DeepseekV2Layer(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 626, in __init__
[rank0]:     self.self_attn = DeepseekV2Attention(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 236, in __init__
[rank0]:     weight=weights.get_weights(f"{prefix}.q_a_proj"),

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 344, in get_weights
[rank0]:     return self.weights_loader.get_weights(self, prefix)

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/gptq/__init__.py", line 155, in get_weights
[rank0]:     raise RuntimeError(

[rank0]: RuntimeError: Cannot load `gptq` weight for GPTQ -> Marlin repacking, make sure the model is already quantized
 rank=0
2024-08-02T04:51:40.980892Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-02T04:51:40.980912Z  INFO text_generation_launcher: Shutting down shards
2024-08-02T04:51:40.983834Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-08-02T04:51:40.984152Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-08-02T04:51:40.985398Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-08-02T04:51:40.985884Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-08-02T04:51:41.008435Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-08-02T04:51:41.008681Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-08-02T04:51:48.113698Z  INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-08-02T04:51:48.589678Z  INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-08-02T04:51:49.291857Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: ShardCannotStart

Additional context

I don't have inference problem with GPTQMode.from_quantized()

Qubitium commented 1 month ago

Looks like TGI has broken loading for sharded gptq models. Note we do yet officially support TGI nor do we have unit tests for TGI.

Please post output of "ls -h" of quantized model folder so we can verify model is sharded.

Cucunnber commented 1 month ago

Looks like TGI has broken loading for sharded gptq models. Note we do yet officially support TGI nor do we have unit tests for TGI.

Please post output of "ls -h" of quantized model folder so we can verify model is sharded.

gptq

It did not sharded correctly. Besides, there are only config.json, configuration_deepseek.py, modelling_deepseek.py, model.safetensors, quantize_config.json in the quantized model folder after quantized, I had to copy all the tokenizer files manually from the oringal model folders.

Cucunnber commented 1 month ago

Is this issue related to this error? gptq

ModelCloud / GPTQModel

[BUG] TGI does not support DeepSeekCoderV2-gptq #328