FAILED & ERROR when running Unit Tests

RealJustinNi commented 12 months ago

Hi etrommer, I met with errors when runing unit tests with "poetry run pytest test". I installed poetry in a conda environment (python=3.10.13) and cloned your code. Then I installed packages with "poetry install --with "dev,extras"" and installed additional dependencies as well as pre-commit hooks fine. However the unit tests report failed for several times and then all errors. I also run the benchmark, though there little failures, most of the rest seems good. Could you help me to solve the errors? thanks:)

My cuda version is 11.7 and the following is the output log of unit test and the benchmark.

unit test

============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.2, pluggy-1.3.0 benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/zhaojun/torch-approx configfile: pyproject.toml plugins: cov-3.0.0, benchmark-4.0.0 collected 436 items

test/test_approx_layer.py .............................FFFFFFEEEEEEEEEEE [ 10%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 27%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 37%] test/test_approx_mm.py EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 48%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 64%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 81%] EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE [ 96%] test/test_dwconv2d.py EEEEEEEEEEEEEE [100%]

==================================== ERRORS ==================================== _ ERROR at setup of test_layer_fwd[cuda-weight_qconfig0-layerconfig6]

@pytest.fixture(autouse=True)
def fix_seed():
    """
    Run before every test.
    - Fixes random seed to make test reproducible
    - Sets CUDA to blocking to allow for benchmarking of normally asynchronous kernels
    """
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
    np.random.seed(42)

  torch.manual_seed(42)

test/conftest.py:36:

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/random.py:40: in manual_seed torch.cuda.manual_seed_all(seed) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/random.py:113: in manual_seed_all _lazy_call(cb, seed_all=True) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/init.py:183: in _lazy_call callable()

def cb():
    for i in range(device_count()):
        default_generator = torch.cuda.default_generators[i]

      default_generator.manual_seed(seed)
E RuntimeError: CUDA error: an illegal memory access was encountered E Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/cuda/random.py:111: RuntimeError _ ERROR at setup of test_layer_fwd[cuda-weight_qconfig1-layerconfig0]

[............................... similar errors .............................................] =================================== FAILURES =================================== __ test_layer_fwd[cuda-weight_qconfig0-layer_config0] __

device = 'cuda' layer_config = (<class 'torch.nn.modules.linear.Linear'>, (4, 20), (20, 10), {}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}

@pytest.mark.parametrize("layer_config", layer_configs)
@pytest.mark.parametrize("weight_qconfig", weight_quant_configs)
def test_layer_fwd(device, layer_config, weight_qconfig):
    input_dims = layer_config[1]
    layer, ref_layer = generate_models(layer_config, device, weight_qconfig)

    x = torch.rand(input_dims, device=device)
    xref = copy.deepcopy(x)

  y = layer(x)

test/test_approx_layer.py:165:

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_linear.py:46: in approx_fwd y = self.approx_op(x, w, quant_params, self.htp_model) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(args, kwargs) src/torchapprox/operators/lut.py:82: in forward return ApproxGeMM.apply(x, w, self.lut, quant_params, htp_model) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(*args, **kwargs) # type: ignore[misc]

x = tensor([[0.8243, 0.2120, 0.7301, 0.3219, 0.7536, 0.2120, 0.9263, 0.1413, 0.3454, 0.9970, 0.5495, 0.2512, 0.92... 0.7536, 0.7065, 0.3297, 0.9106, 0.3925, 0.1727, 0.9813, 0.3690, 0.2591, 0.9185, 0.9891]], device='cuda:0') w = tensor([[ 0.0501, -0.2194, -0.0449, -0.2056, -0.1538, -0.0086, 0.1054, -0.0415, 0.0086, -0.0950, -0.1158, ...0225, 0.0881, -0.2074]], device='cuda:0', grad_fn=) lut = tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 2, ..., -3, -2, -1], [ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3], [ 0, -2, -4, ..., 6, 4, 2], [ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32) quant_params = QuantizationParameters(x_scale=tensor([0.0079], device='cuda:0'), x_zero_point=tensor([0], device='cuda:0', dtype=torch.int32), w_scale=tensor([0.0017], device='cuda:0'), w_zero_point=tensor([0], device='cuda:0', dtype=torch.int32)) htp_model = None

@staticmethod
def forward(  # type: ignore
    x: torch.Tensor,
    w: torch.Tensor,
    lut: torch.Tensor,
    quant_params: "QuantizationParameters",
    htp_model: Optional[Callable],
) -> torch.Tensor:
    """
    Approximate forward operation
    """

    x_q = torch.round((x / quant_params.x_scale) + quant_params.x_zero_point)[
        :, None, :
    ]
    w_q = torch.round(
        (w / quant_params.w_scale[:, None]) + quant_params.w_zero_point[:, None]
    ).T

    if htp_model is None:

      y_q = approx(x_q.char(), w_q.char(), lut).float()
E RuntimeError: CUDA error: invalid argument E Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

src/torchapprox/operators/approxgemm.py:39: RuntimeError __ test_layer_fwd[cuda-weight_qconfig0-layer_config1] __

device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 16, 3), {'groups': 1}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}

@pytest.mark.parametrize("layer_config", layer_configs)
@pytest.mark.parametrize("weight_qconfig", weight_quant_configs)
def test_layer_fwd(device, layer_config, weight_qconfig):
    input_dims = layer_config[1]
    layer, ref_layer = generate_models(layer_config, device, weight_qconfig)

    x = torch.rand(input_dims, device=device)
    xref = copy.deepcopy(x)

  y = layer(x)

test/test_approx_layer.py:165:

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:240: in forward y_q = _im2col_conv2d(x_q, w_q, conv_args, lut, out_dims)

x_q = tensor([[[[105., 27., 93., 41.], [ 96., 27., 118., 18.], [ 44., 127., 70., 32.], ... [ 18., 102., 123., 77.], [ 99., 57., 5., 16.], [ 80., 83., 114., 84.]]]], device='cuda:0') w_q = tensor([[[[ 29., -125., -26.], [-117., -88., -4.], [ 60., -24., 5.]],

     [[ -54.,...
     [[  22., -123.,   89.],
      [  25.,   91., -126.],
      [  62., -107.,  -40.]]]], device='cuda:0')

conv_args = Conv2dArgs(in_channels=8, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1) lut = tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 2, ..., -3, -2, -1], [ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3], [ 0, -2, -4, ..., 6, 4, 2], [ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32) out_dims = (2, 2)

def _im2col_conv2d(
    x_q: torch.FloatTensor,
    w_q: torch.FloatTensor,
    conv_args: Conv2dArgs,
    lut: torch.ShortTensor,
    out_dims: Tuple[int, int],
) -> torch.FloatTensor:
    # Pre-allocate output tensor
    y_q = torch.empty(
        x_q.size(0),
        conv_args.out_channels,
        math.prod(out_dims),
        device=x_q.device,
        dtype=torch.int32,
    )

    w_s8 = w_q.char()
    for group in range(conv_args.groups):
        # Calculate lower and upper channel index for current group
        in_ch_lower, in_ch_upper = _group_limits(
            group, conv_args.groups, conv_args.in_channels
        )
        out_ch_lower, out_ch_upper = _group_limits(
            group, conv_args.groups, conv_args.out_channels
        )

        # Im2Col operation
        x_unfold_s8 = torch.nn.functional.unfold(
            x_q[
                :,
                in_ch_lower:in_ch_upper,
                :,
            ],
            kernel_size=conv_args.kernel_size,
            padding=conv_args.padding,
            stride=conv_args.stride,
            dilation=conv_args.dilation,
        ).char()

        # Reshape weights to 2D
        w_flat_s8 = w_s8[out_ch_lower:out_ch_upper].view(
            conv_args.out_channels // conv_args.groups, -1
        )

        # ApproxGeMM

      y_q[:, out_ch_lower:out_ch_upper] = approx(
w_flat_s8, x_unfold_s8, lut, ) E RuntimeError: CUDA error: invalid argument E Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

src/torchapprox/operators/conv2d.py:200: RuntimeError

__ test_layer_fwd[cuda-weight_qconfig0-layer_config4] __

device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 16, 3), {'groups': 8}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}

@pytest.mark.parametrize("layer_config", layer_configs)
@pytest.mark.parametrize("weight_qconfig", weight_quant_configs)
def test_layer_fwd(device, layer_config, weight_qconfig):
    input_dims = layer_config[1]
    layer, ref_layer = generate_models(layer_config, device, weight_qconfig)

    x = torch.rand(input_dims, device=device)
    xref = copy.deepcopy(x)

  y = layer(x)

test/test_approx_layer.py:165:

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:240: in forward y_q = _im2col_conv2d(x_q, w_q, conv_args, lut, out_dims)

x_q = tensor([[[[105., 27., 93., 41.], [ 96., 27., 118., 18.], [ 44., 127., 70., 32.], ... [ 18., 102., 123., 77.], [ 99., 57., 5., 16.], [ 80., 83., 114., 84.]]]], device='cuda:0') w_q = tensor([[[[ 29., -127., -26.], [-119., -89., -5.], [ 61., -24., 5.]]],

    [[[ -55...
    [[[-119., -126.,   53.],
      [ -20.,  118.,   20.],
      [  50.,   -8., -123.]]]], device='cuda:0')

conv_args = Conv2dArgs(in_channels=8, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=8) lut = tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 2, ..., -3, -2, -1], [ 0, 2, 4, ..., -6, -4, -2]... ..., 9, 6, 3], [ 0, -2, -4, ..., 6, 4, 2], [ 0, -1, -2, ..., 3, 2, 1]], dtype=torch.int32) out_dims = (2, 2)

def _im2col_conv2d(
    x_q: torch.FloatTensor,
    w_q: torch.FloatTensor,
    conv_args: Conv2dArgs,
    lut: torch.ShortTensor,
    out_dims: Tuple[int, int],
) -> torch.FloatTensor:
    # Pre-allocate output tensor
    y_q = torch.empty(
        x_q.size(0),
        conv_args.out_channels,
        math.prod(out_dims),
        device=x_q.device,
        dtype=torch.int32,
    )

    w_s8 = w_q.char()
    for group in range(conv_args.groups):
        # Calculate lower and upper channel index for current group
        in_ch_lower, in_ch_upper = _group_limits(
            group, conv_args.groups, conv_args.in_channels
        )
        out_ch_lower, out_ch_upper = _group_limits(
            group, conv_args.groups, conv_args.out_channels
        )

        # Im2Col operation
        x_unfold_s8 = torch.nn.functional.unfold(
            x_q[
                :,
                in_ch_lower:in_ch_upper,
                :,
            ],
            kernel_size=conv_args.kernel_size,
            padding=conv_args.padding,
            stride=conv_args.stride,
            dilation=conv_args.dilation,
        ).char()

        # Reshape weights to 2D
        w_flat_s8 = w_s8[out_ch_lower:out_ch_upper].view(
            conv_args.out_channels // conv_args.groups, -1
        )

        # ApproxGeMM

      y_q[:, out_ch_lower:out_ch_upper] = approx(
w_flat_s8, x_unfold_s8, lut, ) E RuntimeError: CUDA error: invalid argument E Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

src/torchapprox/operators/conv2d.py:200: RuntimeError __ test_layer_fwd[cuda-weight_qconfig0-layer_config5] __

device = 'cuda' layer_config = (<class 'torch.nn.modules.conv.Conv2d'>, (2, 8, 4, 4), (8, 8, 3), {'groups': 8}) weight_qconfig = functools.partial(<class 'torch.ao.quantization.fake_quantize.FakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127){}

@pytest.mark.parametrize("layer_config", layer_configs)
@pytest.mark.parametrize("weight_qconfig", weight_quant_configs)
def test_layer_fwd(device, layer_config, weight_qconfig):
    input_dims = layer_config[1]
    layer, ref_layer = generate_models(layer_config, device, weight_qconfig)

    x = torch.rand(input_dims, device=device)
    xref = copy.deepcopy(x)

  y = layer(x)

test/test_approx_layer.py:165:

../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl return forward_call(*args, kwargs) src/torchapprox/layers/approx_wrapper.py:60: in forward y_q = self.wrapped(x_q, x_scale, x_zero_point) ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/nn/modules/module.py:1538: in _call_impl result = forward_call(*args, *kwargs) src/torchapprox/layers/approx_conv2d.py:185: in forward return ApproxLayer.forward(self, x_q, x_scale, x_zero_point, bias) src/torchapprox/layers/approx_layer.py:212: in forward y = self.approx_fwd(x, w, quant_params) src/torchapprox/layers/approx_conv2d.py:155: in approx_fwd y = ApproxConv2dOp.apply( ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply return super().apply(args, kwargs) # type: ignore[misc] src/torchapprox/operators/conv2d.py:237: in forward y_q = dwconv2d(x_q, w_q, lut, conv_args.stride, conv_args.padding)

x = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b69d0> w = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b6d40> lut = <[RuntimeError('CUDA error: an illegal memory access was encountered\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f91966b4fe0> stride = (1, 1), padding = (0, 0)

def dwconv2d(
    x: torch.FloatTensor,
    w: torch.FloatTensor,
    lut: torch.ShortTensor,
    stride: int = 1,
    padding: int = 0,
) -> torch.FloatTensor:
    """
    Approximate 2D Depthwise Convolution
    """
    x = x.char()
    w = w.char()

    assert x.device == w.device
    assert x.is_cuda
    assert (
        x.dtype == w.dtype == torch.int8
    ), "Input operands need to be 8-Bit signed Integer"
    assert lut.dtype == torch.int32, "LUT needs to be 32 bit signed Integer"

    def make_tuple(val):
        if not isinstance(val, tuple):
            return (val, val)
        return val

    stride = make_tuple(stride)
    padding = make_tuple(padding)

    lut = lut.to(x.device)
    small = ta_backend.use_dwconv2d_small(x, w, 1, 1, *stride, *padding)
    if small:
        out = ta_backend.dwconv2d_small(x, w, lut, 1, 1, *stride, *padding, True)
    else:
        out = ta_backend.dwconv2d(x, w, lut, 1, 1, *stride, *padding, *padding, True)

  return out.float()
E RuntimeError: CUDA error: an illegal memory access was encountered E Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

src/torchapprox/operators/backend.py:70: RuntimeError =============================== warnings summary =============================== ../anaconda3/envs/approx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:25 /home/zhaojun/anaconda3/envs/approx/lib/python3.10/site-packages/torch/utils/cpp_extension.py:25: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html from pkg_resources import packaging # type: ignore[attr-defined]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config0] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config1] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config2] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config3] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config4] FAILED test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config5] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig0-layer_config6] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig1-layer_config0] ERROR test/test_approx_layer.py::test_layer_fwd[cuda-weight_qconfig1-layer_config1] ......

benchmark

benchmarks/test_bench_torchapprox.py .F........................................F........................................F........................................F........................................F........ [ 70%] ................................F....................................... [100%]

======================================================================= short test summary info ================================================================================================= FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[mobilenet_v2-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[effcientnet_b0-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[vgg16-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[alexnet-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[resnet18-lut] - AssertionError: LUT needs to be signed 32 Bit Integer FAILED benchmarks/test_bench_torchapprox.py::test_bench_torchapprox[resnet50-lut] - AssertionError: LUT needs to be signed 32 Bit Integer

RealJustinNi commented 12 months ago

Hi, I successfully passed the unit test through installing torch-approx by pip. And I modified the src/operators/lut.py at line 57 to change the dtype of lut from int16 to int32. The last error is also solved. But I still have question about the QAT using approx lut. I found the training is extremely slow and my terminal has no output though it keeps running.

etrommer commented 12 months ago

Thank you for giving it a try and reporting the issue in the benchmarks. I've just pushed a fix that sets the LUT to the correct size in the benchmarks.

For the approximate training, a slightly lower throughput is to be expected because the LUT kernel implementation is never going to be as efficient as a regular operation. Especially for small models, this should not be too significant, though. What size of model are you trying out?

I have noticed that the observer implementation from PyTorch causes a significant overhead. Can you try running:

# model is any PyTorch model
model.apply(torch.ao.quantization.disable_observer)

https://pytorch.org/docs/stable/generated/torch.ao.quantization.fake_quantize.disable_observer.html

before you start training on your model to see if that fixes the slow training?

PyTorch has just released an entirely new Quantization API: https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html Which I would hope resolves this.

RealJustinNi commented 12 months ago

Thank you so much for your prompt response! I appreciate your assistance.

I trained a VGG-11 network on the CIFAR-10 dataset with a batch size of 2048. The training process was divided into three stages. In the first stage, I trained the model for 2 epochs to achieve the baseline speed. In the second stage, I applied model quantization (do wrap_quantizable and qat from your Quick Start) and trained in QAT for an additional 2 epochs. Finally, in the third stage, I employed a Lookup Table for approximate multiplication and trained for another 2 epochs.

It seems that my impatience might have led to some confusion, but the network is indeed running. The training times for the three stages are approximately 1x, 4.8x, and 16.1x, respectively. This indeed suggests that LUT is relatively inefficient!

Continuing with my exploration, I tested using htp_models_mul8s["accurate"]. Apart from a slightly longer initial epoch, the subsequent epochs took nearly the same time as the full precision training. It's over ten times faster than LUT!

I have two questions to seek your advice. First, is your work intended for implementing int8 arithmetic multiplication on GPU in PyTorch? Is it possible to extend it to FP approximate multiplication, such as mantissa arithmetic multiplication? Second, have you attempted to evaluate the network's accuracy during convergence using this framework?

Thank you very much :)

etrommer commented 11 months ago

First to answer your questions:

For accurate int8 training and inference on GPUs, I think there are more feature-rich and optimized frameworks avaible [1]. The focus of TorchApprox is the simulation of and retraining for non-accurate/custom product functions that are meant to be used for inference on constrained systems (i.e. accelerators/FPGAs/etc.), so I would say that utilizing the int8 capabilities of GPUs is outside the scope of this work.
For the same reason, I have not investigated whether the concept extends to FP16 approximate multipliers, as - in my experience - int8 and below are the preferred quantization for deploying on constrained platforms. I am currently working on a follow-up publication that extends the concept to integer logarithmic multipliers [2], though! This work will also contain an evaluation of retraining throughput vs. achieved accuracy for a number of different retraining methods, including HTP.

htp_models_mul8s["accurate"] is the accurate baseline, i.e. it only models an accurate product function, not an approximate one, so it is expected to be the fastest. If you want a more representative number, you should pick an HTP model with more coefficients, e.g. this one: https://github.com/etrommer/torch-approx/blob/1a5c9aceb7d716fcf66a148367761cd196533c47/src/torchapprox/operators/htp_models/htp_models_mul8s.py#L39-L47 HTP should still be significantly faster than LUT, though. Its major caveat is that it might not always be applicable, depending on the approximate product function being simulated, whereas LUT is unversially applicable as long as the operands are sufficently small.

[1] https://github.com/TimDettmers/bitsandbytes [2] https://ieeexplore.ieee.org/document/8532287

RealJustinNi commented 11 months ago

Thanks for your answers. I apologize for the typo. What I meant to say is that I understand it as approximate multiplication, not exact arithmetic multiplication. For current large models, storage is more expensive than computation. Therefore, 8-bit representation may be more hardware-friendly than 16-bit representation. In fact, I am also researching methods and hardware circuits for approximate computation. Currently, I am working with floating-point numbers rather than integers and exploring approximate-aware neural network training methods.

In summary, your project has inspired me greatly, and I appreciate your continuous responses. Wishing you happiness in your personal life and great progress in research endeavors. ๐•ᴗ•๐

etrommer / torch-approx

FAILED & ERROR when running Unit Tests #18

benchmark