The outputted model was loaded and served with the LMI container.
Llama-2-70b (Not passing)
Quantization currently failing with
AssertionError
model_service.invoke_handler("quantize", inputs)
File "/tmp/djlserving/cache/djl_python/service_loader.py", line 29, in invoke_handler
return getattr(self.module, function_name)(inputs)
File "/tmp/djlserving/cache/djl_python/huggingface.py", line 607, in quantize
_service.quantize(inputs.get_properties())
File "/tmp/djlserving/cache/djl_python/huggingface.py", line 555, in quantize
awq_model.quantize(self.tokenizer, quant_config=quant_config)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 186, in quantize
self.quantizer.quantize()
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 156, in quantize
scales_list = [
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 157, in <listcomp>
self._search_best_scale(self.modules[i], **layer)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 277, in _search_best_scale
best_scales = self._compute_best_scale(
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 334, in _compute_best_scale
self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 69, in pseudo_quantize_tensor
assert torch.isnan(w).sum() == 0
Description
This PR introduces:
--quantization awq
option when running the partition script oroption.quantize=awq
in serving.propertiesNote on serving tensor_parallel_degree
For Llama-2-7b, the tp_degree is limited to 1,2 based on vLLM AWQ implementation. The reason is the tp_degree must satisfy
Where:
intermediate_size = 11008
group_size = 128
which is defined in quant_config in djl_serving.huggingface.quantize()Validation
Llama-2-7b (Working)
This feature has been tested with Llama-2-7b: Quantization command:
Serving.properties:
The outputted model was loaded and served with the LMI container.
Llama-2-70b (Not passing)
Quantization currently failing with
Serving.properties: