Remove dynamic quantization option for PyTorch models at upload

joshdevins commented 1 year ago

Dynamic quantization of PyTorch models has proven to be a challenge for two reasons.

(1) Dynamic quantization ties the traced TorchScript model to a particular architecture and makes it non-portable. For example, tracing the model (by using the upload CLI) on an ARM-based M-series Apple processor will make it non-portable to an Intel CPU, and vice versa. Tracing a model in this way also means that any Intel-based optimisations cannot be used. The best practice is to trace the model on the same CPU architecture as the target inference processors. Adding in GPU support adds a further complexity and eland is currently not even capable of tracing with GPU (for now).

(2) "Blind" dynamic quantization at upload time could also be considered as an anti-pattern/not a best practice. Quantization can often damage the accuracy of a model and doing quantization blindly, without evaluating the model afterwards, can produce surprising results at inference.

For these reasons, we believe it is safest to remove dynamic quantization as an option. If users would like to use quantized models, they can do so in PyTorch or transformers directly, and upload their new model with eland's Python methods (as opposed to using the CLI).

davidkyle commented 1 year ago

Dynamic quantisation is controlled by the --quantize parameter to the eland_import_hub_model script. It has always been considered an advanced option and should now be deprecated. The script should emit an warning when the option is used describing the hardware incompatibility problem .

davidkyle commented 1 year ago

To understand exactly what happens when quantising on a different architecture to the one used at evaluation I used the eland_import_hub_model to trace a quantised model on an M1 mac and upload it to an X86 linux server for evaluation.

Tracing the model with the --quantize option fails on an M1 Mac with the error:

RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine

Full stack trace

Traceback (most recent call last): File "/usr/local/bin/eland_import_hub_model", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/dist-packages/eland/cli/eland_import_hub_model.py", line 235, in main tm = TransformerModel( File "/usr/local/lib/python3.9/dist-packages/eland/ml/pytorch/transformers.py", line 630, in __init__ self._traceable_model.quantize() File "/usr/local/lib/python3.9/dist-packages/eland/ml/pytorch/traceable_model.py", line 43, in quantize torch.quantization.quantize_dynamic( File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 450, in quantize_dynamic convert(model, mapping, inplace=True) File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 535, in convert _convert( File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace [Previous line repeated 3 more times] File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 575, in _convert reassign[name] = swap_module(mod, mapping, custom_module_class_mapping) File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 608, in swap_module new_mod = qmod.from_float(mod) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 111, in from_float qlinear = cls(mod.in_features, mod.out_features, dtype=dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 35, in __init__ super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 150, in __init__ self._packed_params = LinearPackedParams(dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 27, in __init__ self.set_weight_bias(wq, None) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 32, in set_weight_bias self._packed_params = torch.ops.quantized.linear_prepack(weight, bias) File "/usr/local/lib/python3.9/dist-packages/torch/_ops.py", line 442, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine

The models sentence-transformers/msmarco-MiniLM-L-12-v3 and dslim/bert-base-NER were tested

docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u elastic -p $CLOUD_PWD \
      --hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
      --task-type text_embedding \
      --quantize

docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u elastic -p $CLOUD_PWD \
      --hub-model-id dslim/bert-base-NER \
      --task-type text_embedding \
      --quantize

The 8.9 docker image with version 1.13.1 of PyTorch was used in this test.

elastic / eland

Remove dynamic quantization option for PyTorch models at upload #594