fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.22k stars 396 forks source link

Quartus Streaming Conv, Pooling & Image layers #656

Closed bo3z closed 1 year ago

bo3z commented 1 year ago

Description

  • Adds support for image-related layers (Conv 1D & 2D, Avg & Max Pooling, Global Pooling, Zero Padding, Upsampling) in io_stream in a similar manner to Vivado
  • Conv 1D & 2D implemented using line buffer, similar to Vivado. Main difference is in the implementation of padding for Conv layers - Vivado inserts a padding layer; Quartus performs padding in the Conv layer. This approach stays in line with the Keras model graph and the total number of layers.
  • Same padding is not supported for Pooling layers.
  • Written a custom struct to act as a shift register in hardware (Intel HLS does not offer an out-of-the-box shift register). However, any struct with a similar implementation (and meeting certain time / loop requirements) will be synthesised as a shift register. This can be verified by viewing the synthesis report in report.html > Area Analysis of System
  • Upsampling and Zero Padding layers written in a largely similar way to Vivado
  • Resource usage and latency results coming soon.
  • Transpose layer to be added soon.
  • Bug fix introduced by PR #561 for parallel transpose layers
  • It is recommended to review this PR commit by commit, as each commit adds a single piece of functionality, is self-contained and the project can be compiled individually

Type of change

Tests

All of the existing tests were expanded to include tests for Quartus in io_stream. No new tests were written. A summary of the tests is given below.

  • test_keras_api.py - Ensures correct parsing of the layers in io_stream and correct syntax (no compilation errors) of Conv 1D & Conv 2D layers.
  • test_cnn_mnist.py, test_cnn_mnist_qkeras.py, test_conv1d.py - Verify the numerical accuracy and compilation of Conv 1D, Conv 2D, Max & Avg Pooling layers.
  • test_upsampling.py and test_zeropadding.py - Ensures numerical accuracy and successful compilation of Zero Padding and Upsampling layers.
  • test_globalpooling.py Ensures numerical accuracy and successful compilation of Global Pooling layers.

Synthesis results

Below are results obtained through full Quartus synthesis of Conv2D layers for a fixed input (32x32x3) when varying the number of filters and the reuse factors. Other layers were tested for correct synthesis.

image

Checklist

jmitrevs commented 1 year ago

pytest.activations is failing:

E       AssertionError: 
E       Not equal to tolerance rtol=0.02, atol=0.02
E       
E       Mismatched elements: 8000 / 8000 (100%)
E       Max absolute difference: 1.12238881
E       Max relative difference: 8914.97600568
E        x: array([[0.793945, 0.791992, 0.798828, ..., 0.804688, 0.791016, 0.799805],
E              [0.791016, 0.802734, 0.804688, ..., 0.799805, 0.799805, 0.794922],
E              [0.795898, 0.808594, 0.803711, ..., 0.793945, 0.796875, 0.801758],...
E        y: array([[-0.227973, -0.279667, -0.045713, ...,  0.226889, -0.28958 ,
E                0.031885],
E              [-0.292061,  0.154492,  0.214236, ...,  0.041079, -0.003215,...
test_activations.py:55: AssertionError

Can you see why?

bo3z commented 1 year ago

pytest.activations is failing:

E       AssertionError: 
E       Not equal to tolerance rtol=0.02, atol=0.02
E       
E       Mismatched elements: 8000 / 8000 (100%)
E       Max absolute difference: 1.12238881
E       Max relative difference: 8914.97600568
E        x: array([[0.793945, 0.791992, 0.798828, ..., 0.804688, 0.791016, 0.799805],
E              [0.791016, 0.802734, 0.804688, ..., 0.799805, 0.799805, 0.794922],
E              [0.795898, 0.808594, 0.803711, ..., 0.793945, 0.796875, 0.801758],...
E        y: array([[-0.227973, -0.279667, -0.045713, ...,  0.226889, -0.28958 ,
E                0.031885],
E              [-0.292061,  0.154492,  0.214236, ...,  0.041079, -0.003215,...
test_activations.py:55: AssertionError

Can you see why?

This was addressed in a PR #655 that was already merged. It comes from the fact that the parallel Softsign was optimised in #585, by removing unnecessary values in the LUT but required changes in logic.

jmitrevs commented 1 year ago

It generally looks good to me so I approved it. I sort of wanted to trigger the pytests again, but couldn't figure out how.

jmitrevs commented 1 year ago

I can merge it later today unless someone wants to check more.

vloncar commented 1 year ago

I need some more time to go through this.

vloncar commented 1 year ago

@jmitrevs All the issues have been resolved. Do you want to take another pass at this or we merge it?

jmitrevs commented 1 year ago

Using a slightly older branch, I noticed that in a project I created the using stream definition is in both defines.h and nnet_helpers.h. Is that still the case and needed? (I was hacking the definition in one and I got an error that the two definitions didn't match.

vloncar commented 1 year ago

I removed the definitions from nnet_helpers.h. All tests (python compile, make and quartus compile) pass.

The only issue remaining with this PR is that occasionally the padding routines don't work with a cryptic error from the compiler: Compiler Error: Multiple reflexive accesses from stream 'layer2_out' is not allowed. This happens for ZeroPadding1D/2D and Conv1D/2D (with same padding) under certain scenarios. This still needs some understanding, potentially with help from Intel, so I wouldn't block the merge of this just because of that. @jmitrevs?

jmitrevs commented 1 year ago

Just for completeness, this alternate unoptimized 1d padding implementation does not suffer the error:

template<class data_T, class res_T, typename CONFIG_T>
void zeropad1d_cl(stream<data_T> &data, stream<res_T> &res) {

    res_T res_array[CONFIG_T::out_width];

    ZeroOutputArray:
    for (int i = 0; i < CONFIG_T::out_width; i++) {
        for (int j = 0; j < CONFIG_T::n_chan; j++) {
            res_array[i][j] = 0;
        }
    }

    CopyMain:
    for (int i = 0; i < CONFIG_T::in_width; i++) {
        auto dataval = data.read();
        for (int j = 0; j < CONFIG_T::n_chan; j++) {
            res_array[i+CONFIG_T::pad_left][j] = dataval[j];
        }
    }

    StreamOut:
    for (int i = 0; i < CONFIG_T::out_width; i++) {
        res.write(res_array[i]);
    }
}

Nevertheless, why what we have fails is not clear to me. I'll leave some time for comments, but if no one objects, we can merge this weekend.