Open Cucunnber opened 1 month ago
Looks like TGI has broken loading for sharded gptq models. Note we do yet officially support TGI nor do we have unit tests for TGI.
Please post output of "ls -h" of quantized model folder so we can verify model is sharded.
Looks like TGI has broken loading for sharded gptq models. Note we do yet officially support TGI nor do we have unit tests for TGI.
Please post output of "ls -h" of quantized model folder so we can verify model is sharded.
It did not sharded correctly. Besides, there are only config.json, configuration_deepseek.py, modelling_deepseek.py, model.safetensors, quantize_config.json in the quantized model folder after quantized, I had to copy all the tokenizer files manually from the oringal model folders.
Is this issue related to this error?
Describe the bug
I get the error Cannot load
gptq
weight for GPTQ -> Marlin repacking, make sure the model is already quantized when i inference gptq quantized model DeepSeekCoderV2 with Text-generation-inference 2.2.0.GPU Info
A100-80GB * 4
config.json
quantize_config.json
To Reproduce
Model/Datasets
DeepSeekCoderV2-236B-MOE
Screenshots
Additional context
I don't have inference problem with
GPTQMode.from_quantized()