deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
196 stars 64 forks source link

AutoAWQ Integration Script #2038

Closed a-ys closed 3 months ago

a-ys commented 4 months ago

Description

This PR introduces:

  1. A "quantize" handler in the djl_serving.huggingface handler that uses AutoAWQ to quantize a model.
  2. A code path in the serving partitioning scripts that allows DIY users to run quantization.
    • This is enabled when a user passes the --quantization awq option when running the partition script or option.quantize=awq in serving.properties
  3. A Neo handler script that allows Neo to run quantization through Neo's expected interface.

Note on serving tensor_parallel_degree

For Llama-2-7b, the tp_degree is limited to 1,2 based on vLLM AWQ implementation. The reason is the tp_degree must satisfy

(intermediate_size / tp_degree) % group_size == 0

Where:

Validation

Llama-2-7b (Working)

This feature has been tested with Llama-2-7b: Quantization command:

docker run -it --rm \
        -v ./llama-2-7b:/opt/ml/input/data/training \
        -v ./logs:/opt/djl/logs \
        -v ./output:/opt/djl/output \
        --runtime=nvidia \
        --shm-size=12gb \
        deepjavalibrary/djl-serving:lmi-nightly partition --save-mp-checkpoint-path /opt/djl/output --skip-copy

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq

The outputted model was loaded and served with the LMI container.

Llama-2-70b (Not passing)

Quantization currently failing with

AssertionError
    model_service.invoke_handler("quantize", inputs)
  File "/tmp/djlserving/cache/djl_python/service_loader.py", line 29, in invoke_handler
    return getattr(self.module, function_name)(inputs)
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 607, in quantize
    _service.quantize(inputs.get_properties())
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 555, in quantize
    awq_model.quantize(self.tokenizer, quant_config=quant_config)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 186, in quantize
    self.quantizer.quantize()
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 156, in quantize
    scales_list = [
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 157, in <listcomp>
    self._search_best_scale(self.modules[i], **layer)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 277, in _search_best_scale
    best_scales = self._compute_best_scale(
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 334, in _compute_best_scale
    self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 69, in pseudo_quantize_tensor
    assert torch.isnan(w).sum() == 0

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq
a-ys commented 3 months ago

Update: these last few commits include:

Additionally, 70b is able to be quantized now. Previously the error was due to corrupted model weights from incomplete download.