Closed RealJustinNi closed 12 months ago
Hi, I successfully passed the unit test through installing torch-approx by pip. And I modified the src/operators/lut.py at line 57 to change the dtype of lut from int16 to int32. The last error is also solved. But I still have question about the QAT using approx lut. I found the training is extremely slow and my terminal has no output though it keeps running.
Thank you for giving it a try and reporting the issue in the benchmarks. I've just pushed a fix that sets the LUT to the correct size in the benchmarks.
For the approximate training, a slightly lower throughput is to be expected because the LUT kernel implementation is never going to be as efficient as a regular operation. Especially for small models, this should not be too significant, though. What size of model are you trying out?
I have noticed that the observer implementation from PyTorch causes a significant overhead. Can you try running:
# model is any PyTorch model
model.apply(torch.ao.quantization.disable_observer)
https://pytorch.org/docs/stable/generated/torch.ao.quantization.fake_quantize.disable_observer.html
before you start training on your model to see if that fixes the slow training?
PyTorch has just released an entirely new Quantization API: https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html Which I would hope resolves this.
Thank you so much for your prompt response! I appreciate your assistance.
I trained a VGG-11 network on the CIFAR-10 dataset with a batch size of 2048. The training process was divided into three stages. In the first stage, I trained the model for 2 epochs to achieve the baseline speed. In the second stage, I applied model quantization (do wrap_quantizable and qat from your Quick Start) and trained in QAT for an additional 2 epochs. Finally, in the third stage, I employed a Lookup Table for approximate multiplication and trained for another 2 epochs.
It seems that my impatience might have led to some confusion, but the network is indeed running. The training times for the three stages are approximately 1x, 4.8x, and 16.1x, respectively. This indeed suggests that LUT is relatively inefficient!
Continuing with my exploration, I tested using htp_models_mul8s["accurate"]. Apart from a slightly longer initial epoch, the subsequent epochs took nearly the same time as the full precision training. It's over ten times faster than LUT!
I have two questions to seek your advice. First, is your work intended for implementing int8 arithmetic multiplication on GPU in PyTorch? Is it possible to extend it to FP approximate multiplication, such as mantissa arithmetic multiplication? Second, have you attempted to evaluate the network's accuracy during convergence using this framework?
Thank you very much :)
First to answer your questions:
htp_models_mul8s["accurate"]
is the accurate baseline, i.e. it only models an accurate product function, not an approximate one, so it is expected to be the fastest. If you want a more representative number, you should pick an HTP model with more coefficients, e.g. this one:
https://github.com/etrommer/torch-approx/blob/1a5c9aceb7d716fcf66a148367761cd196533c47/src/torchapprox/operators/htp_models/htp_models_mul8s.py#L39-L47
HTP should still be significantly faster than LUT, though. Its major caveat is that it might not always be applicable, depending on the approximate product function being simulated, whereas LUT is unversially applicable as long as the operands are sufficently small.
[1] https://github.com/TimDettmers/bitsandbytes [2] https://ieeexplore.ieee.org/document/8532287
Thanks for your answers. I apologize for the typo. What I meant to say is that I understand it as approximate multiplication, not exact arithmetic multiplication. For current large models, storage is more expensive than computation. Therefore, 8-bit representation may be more hardware-friendly than 16-bit representation. In fact, I am also researching methods and hardware circuits for approximate computation. Currently, I am working with floating-point numbers rather than integers and exploring approximate-aware neural network training methods.
In summary, your project has inspired me greatly, and I appreciate your continuous responses. Wishing you happiness in your personal life and great progress in research endeavors. ๐•ᴗ•๐
Hi etrommer, I met with errors when runing unit tests with "poetry run pytest test". I installed poetry in a conda environment (python=3.10.13) and cloned your code. Then I installed packages with "poetry install --with "dev,extras"" and installed additional dependencies as well as pre-commit hooks fine. However the unit tests report failed for several times and then all errors. I also run the benchmark, though there little failures, most of the rest seems good. Could you help me to solve the errors? thanks:)
My cuda version is 11.7 and the following is the output log of unit test and the benchmark.
============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.2, pluggy-1.3.0 benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/zhaojun/torch-approx configfile: pyproject.toml plugins: cov-3.0.0, benchmark-4.0.0 collected 436 items
test/test_approx_layer.py .............................FFFFFFEEEEEEEEEEE [ 10%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 27%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 37%] test/test_approx_mm.py EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 48%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 64%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 81%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 96%] test/test_dwconv2d.py EEEEEEEEEEEEEE [100%]
==================================== ERRORS ==================================== _ ERROR at setup of test_layer_fwd[cuda-weight_qconfig0-layerconfig6]
test/conftest.py:36:
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/random.py:40: in manual_seed torch.cuda.manual_seed_all(seed) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/random.py:113: in manual_seed_all _lazy_call(cb, seed_all=True) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/init.py:183: in _lazy_call callable()
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/random.py:111: RuntimeError _ ERROR at setup of test_layer_fwd[cuda-weight_qconfig1-layerconfig0]
[............................... similar errors .............................................] =================================== FAILURES =================================== __ test_layer_fwd[cuda-weight_qconfig0-layer_config0] __
device = 'cuda' layer_config = (<class 'torch.nn.modules.linear.Linear'>, (4, 20), (20, 10), {}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}
test/test_approx_layer.py:165:
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_linear.py:46: in approx_fwd y = self.approx_op(x, w, quant_params, self.htp_model) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(args, kwargs) src/torchapprox/operators/lut.py:82: in forward return ApproxGeMM.apply(x, w, self.lut, quant_params, htp_model) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(*args, **kwargs) # type: ignore[misc]
x = tensor([[0.8243, 0.2120, 0.7301, 0.3219, 0.7536, 0.2120, 0.9263, 0.1413, 0.3454, 0.9970, 0.5495, 0.2512, 0.92... 0.7536, 0.7065, 0.3297, 0.9106, 0.3925, 0.1727, 0.9813, 0.3690, 0.2591, 0.9185, 0.9891]], device='cuda:0') w = tensor([[ 0.0501, -0.2194, -0.0449, -0.2056, -0.1538, -0.0086, 0.1054, -0.0415, 0.0086, -0.0950, -0.1158, ...0225, 0.0881, -0.2074]], device='cuda:0', grad_fn=)
lut = tensor([[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 1, 2, ..., -3, -2, -1],
[ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3],
[ 0, -2, -4, ..., 6, 4, 2],
[ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32)
quant_params = QuantizationParameters(x_scale=tensor([0.0079], device='cuda:0'), x_zero_point=tensor([0], device='cuda:0', dtype=torch.int32), w_scale=tensor([0.0017], device='cuda:0'), w_zero_point=tensor([0], device='cuda:0', dtype=torch.int32))
htp_model = None
src/torchapprox/operators/approxgemm.py:39: RuntimeError __ test_layer_fwd[cuda-weight_qconfig0-layer_config1] __
device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 16, 3), {'groups': 1}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}
test/test_approx_layer.py:165:
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:240: in forward y_q = _im2col_conv2d(x_q, w_q, conv_args, lut, out_dims)
x_q = tensor([[[[105., 27., 93., 41.], [ 96., 27., 118., 18.], [ 44., 127., 70., 32.], ... [ 18., 102., 123., 77.], [ 99., 57., 5., 16.], [ 80., 83., 114., 84.]]]], device='cuda:0') w_q = tensor([[[[ 29., -125., -26.], [-117., -88., -4.], [ 60., -24., 5.]],
conv_args = Conv2dArgs(in_channels=8, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1) lut = tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 2, ..., -3, -2, -1], [ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3], [ 0, -2, -4, ..., 6, 4, 2], [ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32) out_dims = (2, 2)
src/torchapprox/operators/conv2d.py:200: RuntimeError
__ test_layer_fwd[cuda-weight_qconfig0-layer_config4] __
device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 16, 3), {'groups': 8}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}
test/test_approx_layer.py:165:
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:240: in forward y_q = _im2col_conv2d(x_q, w_q, conv_args, lut, out_dims)
x_q = tensor([[[[105., 27., 93., 41.], [ 96., 27., 118., 18.], [ 44., 127., 70., 32.], ... [ 18., 102., 123., 77.], [ 99., 57., 5., 16.], [ 80., 83., 114., 84.]]]], device='cuda:0') w_q = tensor([[[[ 29., -127., -26.], [-119., -89., -5.], [ 61., -24., 5.]]],
conv_args = Conv2dArgs(in_channels=8, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=8) lut = tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 2, ..., -3, -2, -1], [ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3], [ 0, -2, -4, ..., 6, 4, 2], [ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32) out_dims = (2, 2)
src/torchapprox/operators/conv2d.py:200: RuntimeError __ test_layer_fwd[cuda-weight_qconfig0-layer_config5] __
device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 8, 3), {'groups': 8}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}
test/test_approx_layer.py:165:
../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:237: in forward y_q = dwconv2d(x_q, w_q, lut, conv_args.stride, conv_args.padding)
x = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b69d0> w = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile withTORCH_USE_CUDA_DSA
to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b6d40> lut = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile withTORCH_USE_CUDA_DSA
to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b4fe0> stride = (1, 1), padding = (0, 0)src/torchapprox/operators/backend.py:70: RuntimeError =============================== warnings summary =============================== ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:25 /home/zhaojun/anaconda3/envs/approx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:25: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html from pkg_resources import packaging # type: ignore[attr-defined]
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config0] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config1] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config2] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config3] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config4] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config5] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config6] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig1-layer_config0] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig1-layer_config1] ......
benchmark
benchmarks/test_bench_torchapprox.py .F........................................F........................................F........................................F........................................F........ [ 70%] ................................F....................................... [100%]
======================================================================= short test summary info ================================================================================================= FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[mobilenet_v2-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[effcientnet_b0-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[vgg16-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[alexnet-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[resnet18-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[resnet50-lut] - AssertionError: LUT needs to be signed 32 Bit Integer