nhanvtran commented 6 years ago

So far we have 2 working modes -- one for LHC trigger (low reuse factor, 1-6, weights in the fabric) and one for "naive" serial mode (see PR #45).

One interesting mode is a very large reuse factor and weights stored in the fabric. This is particularly useful for really big networks and SDAccel-like use cases.

However, playing around with the current code, I found that it's not so trivial to store weights in BRAMs and run with #pragma HLS PIPELINE. HLS always wants to partition the weight array.

@ejk43 you have any ideas here? I thought we could remove the PIPELINE and go back to the days of DATAFLOW pragma...

ejk43 commented 6 years ago

Just found this issue again... A few thoughts...

Serial mode BRAMs are working now for dense layers
Storing weights in BRAMs in parallel mode (using high reuse factors) should be feasible. I may take a look at this. I doubt removing the top-level PIPELINE is a good idea -- it just works so well for so many algorithms-- but it sounds like it should be feasible.
Does this issue only apply to dense layers? Or do you want to consider conv layers too? Any other high-priority layers that would want to save weights in BRAM (maybe LSTM)?

nhanvtran commented 6 years ago

Storing weights in BRAMs in parallel mode (using high reuse factors) should be feasible. I may take a look at this. I doubt removing the top-level PIPELINE is a good idea -- it just works so well for so many algorithms-- but it sounds like it should be feasible.

This would be good to conceptualize. We agree it should be possible -- the question was how clunky it would be. We thought that one would start by making small arrays with length == reuse factor and read off the arrays from BRAMs in that way, but there are complications with pruned models. Maybe you have some more native way to implement this? @GiuseppeDiGuglielmo was also thinking about the memory storage perhaps in other global schemes including this issue so maybe it'll be good to stay in touch here

Does this issue only apply to dense layers? Or do you want to consider conv layers too? Any other high-priority layers that would want to save weights in BRAM (maybe LSTM)?

I think we want to extend this to many architectures but I think we agree that it's most transferrable from dense to conv. This goes with your other comment about serial mode in conv. The LSTM case is interesting ... in the static case it's not as important as the fully unrolled case.

violatingcp commented 6 years ago

For the store in b-ram functionality. I think we can follow the sub-layer setup. Basically when we start splitting the matrices into smaller ones to limit eh multiplications we can store the weights of the elements not being multiplied in block ram. I will run some quick tests to see how this works, but it seems the most natural use for me. Also if this is the case, it should be easy to add this to any of the setups that already using the sublayer tech (dense/LSTM/conv?)

On Tue, Sep 4, 2018 at 11:37 PM Nhan Tran notifications@github.com wrote:

Storing weights in BRAMs in parallel mode (using high reuse factors) should be feasible. I may take a look at this. I doubt removing the top-level PIPELINE is a good idea -- it just works so well for so many algorithms-- but it sounds like it should be feasible.

This would be good to conceptualize. We agree it should be possible -- the question was how clunky it would be. We thought that one would start by making small arrays with length == reuse factor and read off the arrays from BRAMs in that way, but there are complications with pruned models. Maybe you have some more native way to implement this? @GiuseppeDiGuglielmo https://github.com/GiuseppeDiGuglielmo was also thinking about the memory storage perhaps in other global schemes including this issue so maybe it'll be good to stay in touch here

Does this issue only apply to dense layers? Or do you want to consider conv layers too? Any other high-priority layers that would want to save weights in BRAM (maybe LSTM)?

I think we want to extend this to many architectures but I think we agree that it's most transferrable from dense to conv. This goes with your other comment about serial mode in conv. The LSTM case is interesting ... in the static case it's not as important as the fully unrolled case.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/hls4ml/issues/46#issuecomment-418589199, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4WyONKD11LldxE55q34HoEG1SnH5ks5uX0cIgaJpZM4SK9fM .

ejk43 commented 6 years ago

I have a possible implementation here: https://github.com/hls-fpga-machine-learning/hls4ml/commits/ejk/add-brams-to-parallel-mode

This involves specifying a cyclic partition on the "weights" variable, since that is what we want to put into BRAM, and applying a RESOURCE pragma. The partitioning will only apply if there is both reuse_factor > 1 and store_weights_in_bram == true.

I've tested on the KERAS_1layer model so far. I set reuse=16 and store_weights_in_bram=true. The results seem to instantiate the correct number of BRAMs based on the reuse factor, but the estimated fabric usage is identical to the original fabric usage. So overall I'm not convinced this is totally working....

+-----------------+---------+-------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  |
+-----------------+---------+-------+--------+--------+
|DSP              |        -|      -|       -|       -|
|Expression       |        -|      -|       0|       6|
|FIFO             |        -|      -|       -|       -|
|Instance         |       21|     22|    8663|   13514|
|Memory           |        -|      -|       -|       -|
|Multiplexer      |        -|      -|       -|     139|
|Register         |        -|      -|    1372|       -|
+-----------------+---------+-------+--------+--------+
|Total            |       21|     22|   10035|   13659|
+-----------------+---------+-------+--------+--------+
|Available        |     2940|   3600|  866400|  433200|
+-----------------+---------+-------+--------+--------+
|Utilization (%)  |    ~0   |   ~0  |       1|       3|
+-----------------+---------+-------+--------+--------+

One hypothesis for higher-than-expected fabric usage is that the division between the "multiply" operation and the "accumulate" operation may be throwing off synthesis; Perhaps, for example, the same amount of fabric that was previously used as weights lookups in the multiply operation is now used as temporary storage (or pipelining) when waiting for the rest of the multiply operation to complete before going to the accumulate operation. Ideally, each step of the multiply operation would reuse the DSPs and logic, and simply read in the corresponding weights from BRAM, then progress immediately into the accumulator. Unfortunately, if I recombine the multiply and accumulate operations in the C code, synthesis estimates about 1.5x higher than the worst amount of resources I've seen with the loops uncombined, so this might be a false lead?

violatingcp commented 6 years ago

Hi EJ,

Dylan and I were chatting today about basically the same thing and apparently, Javier was chatting with Nhan about this too. I think we are all converging on the same thing. The reuse factor is the most natural way to get this to work.

I have been running tests with a bunch of matrix multiplication. I setup multiplication with a 2D array, where I can partition things into block RAM in a nice way, then I went to the 1D version. The final code I added is this

pragma HLS RESOURCE variable=weights core=ROM_2P_BRAM

pragma HLS ARRAY_RESHAPE variable=weights factor=cycle_factor*2

I had to add the factor of two, I guess this is a feature of the reshape. In any case with this I get the expected performance. I can make the PR tomorrow, or you can check quickly. One thing that we will have to work out is how to get zero suppression when we build the BRAM. This is a bit complicated, we might want to add this in the keras-to-hls step. Not sure.

Anyway Below are the resources,

Thanks Phil

Without BRAM using PIPELINE to mimic BRAM Resources

Summary:

+-----------------+---------+-------+--------+--------+

+-----------------+---------+-------+--------+--------+

|DSP | -| -| -| -|

|Expression | -| -| 0| 2|

|FIFO | -| -| -| -|

|Instance | 13| 46| 12962| 24736|

|Memory | -| -| -| -|

|Multiplexer | -| -| -| 233|

|Register | -| -| 1398| -|

+-----------------+---------+-------+--------+--------+

|Total | 13| 46| 14360| 24971|

+-----------------+---------+-------+--------+--------+

|Available | 2940| 3600| 866400| 433200|

+-----------------+---------+-------+--------+--------+

|Utilization (%) | ~0 | 1| 1| 5|

+-----------------+---------+-------+--------+--------+

Latency (clock cycles):
- Summary:
+-----+-----+-----+-----+---------+

| Latency | Interval | Pipeline|

| min | max | min | max | Type |

+-----+-----+-----+-----+---------+

| 43| 43| 43| 43| none |

+-----+-----+-----+-----+---------+

Note the smaller latency is becuase of zero suppression

vs

With BRAM using UNROLL or PIPELINE (Results are the same)

Summary:

+-----------------+---------+-------+--------+--------+

+-----------------+---------+-------+--------+--------+

|DSP | -| -| -| -|

|Expression | -| -| 0| 2|

|FIFO | -| -| -| -|

|Instance | 58| 46| 18551| 19218|

|Memory | -| -| -| -|

|Multiplexer | -| -| -| 305|

|Register | -| -| 1416| -|

+-----------------+---------+-------+--------+--------+

|Total | 58| 46| 19967| 19525|

+-----------------+---------+-------+--------+--------+

|Available | 2940| 3600| 866400| 433200|

+-----------------+---------+-------+--------+--------+

|Utilization (%) | 1| 1| 2| 4|

+-----------------+---------+-------+--------+--------+

Latency (clock cycles):
- Summary:
+-----+-----+-----+-----+---------+

| Latency | Interval | Pipeline|

| min | max | min | max | Type |

+-----+-----+-----+-----+---------+

| 61| 61| 61| 61| none |

+-----+-----+-----+-----+---------+

with cyclic partition

Summary:

+-----------------+---------+-------+--------+--------+

+-----------------+---------+-------+--------+--------+

|DSP | -| -| -| -|

|Expression | -| -| 0| 2|

|FIFO | -| -| -| -|

|Instance | 58| 46| 17419| 20389|

|Memory | -| -| -| -|

|Multiplexer | -| -| -| 305|

|Register | -| -| 1416| -|

+-----------------+---------+-------+--------+--------+

|Total | 58| 46| 18835| 20696|

+-----------------+---------+-------+--------+--------+

|Available | 2940| 3600| 866400| 433200|

+-----------------+---------+-------+--------+--------+

|Utilization (%) | 1| 1| 2| 4|

+-----------------+---------+-------+--------+--------+

On Tue, Sep 11, 2018 at 9:29 PM EJ Kreinar notifications@github.com wrote:

I have a possible implementation here: https://github.com/hls-fpga-machine-learning/hls4ml/commits/ejk/add-brams-to-parallel-mode

This involves specifying a cyclic partition on the "weights" variable, since that is what we want to put into BRAM, and applying a RESOURCE pragma. The partitioning will only apply if there is both reuse_factor > 1 and store_weights_in_bram == true.

I've tested on the KERAS_1layer model so far. I set reuse=16 and store_weights_in_bram=true. The results seem to instantiate the correct number of BRAMs based on the reuse factor, but the estimated fabric usage is identical to the original fabric usage. So overall I'm not convinced this is totally working....

+-----------------+---------+-------+--------+--------+ | Name | BRAM_18K| DSP48E| FF | LUT | +-----------------+---------+-------+--------+--------+ |DSP | -| -| -| -| |Expression | -| -| 0| 6| |FIFO | -| -| -| -| |Instance | 21| 22| 8663| 13514| |Memory | -| -| -| -| |Multiplexer | -| -| -| 139| |Register | -| -| 1372| -| +-----------------+---------+-------+--------+--------+ |Total | 21| 22| 10035| 13659| +-----------------+---------+-------+--------+--------+ |Available | 2940| 3600| 866400| 433200| +-----------------+---------+-------+--------+--------+ |Utilization (%) | ~0 | ~0 | 1| 3| +-----------------+---------+-------+--------+--------+

One hypothesis for higher-than-expected fabric usage is that the division between the "multiply" operation and the "accumulate" operation may be throwing off synthesis; Perhaps, for example, the same amount of fabric that was previously used as weights lookups in the multiply operation is now used as temporary storage (or pipelining) when waiting for the rest of the multiply operation to complete before going to the accumulate operation. Ideally, each step of the multiply operation would reuse the DSPs and logic, and simply read in the corresponding weights from BRAM, then progress immediately into the accumulator. Unfortunately, if I recombine the multiply and accumulate operations in the C code, synthesis estimates about 1.5x higher than the worst amount of resources I've seen with the loops uncombined, so this might be a false lead?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/hls4ml/issues/46#issuecomment-420480774, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4f-SZfADwc0tGzdxOJUa5WI_1388ks5uaGN2gaJpZM4SK9fM .

nhanvtran commented 6 years ago

hi @ejk43 @violatingcp @GiuseppeDiGuglielmo @jmduarte

I've been playing with this branch ejk/add-brams-to-parallel-mode and I'm pretty sure we can use the block partitioning as an alternative solution to sublayer (and also can store weights in BRAM!)

I've been messing with this block of code: https://github.com/hls-fpga-machine-learning/hls4ml/blob/ejk/add-brams-to-parallel-mode/nnet_utils/nnet_layer.h#L62-L84

Replacing it with the code below (and a proper reuse factor) gets rid of the dreaded partition error (testing with 3-layer, uncompressed) and still retains the nice features of reuse factor. Then it's just an extra flag to store the weight in BRAM.

If we can get this type of construction to work with pruning -- I think it's basically the recipe for getting to larger designs seamlessly.

Of course we should check the automated testing that it's not breaking other stuff...

    if (CONFIG_T::io_type == io_parallel){
        // For parallel inputs:
        //   - completely partition arrays -- target fabric
        //   - if we have an unroll factor, limit number of multipliers
        #pragma HLS PIPELINE II=CONFIG_T::reuse_factor

        // #pragma HLS ARRAY_PARTITION variable=weights complete // remove this line for now, it breaks compression sometimes
        #pragma HLS ARRAY_PARTITION variable=biases complete
        // #pragma HLS ARRAY_PARTITION variable=mult complete
        #pragma HLS ARRAY_PARTITION variable=acc complete

        int block_factor = CONFIG_T::n_in*CONFIG_T::n_out;
        block_factor = block_factor / CONFIG_T::reuse_factor;
        #pragma HLS ARRAY_RESHAPE variable=weights block factor=block_factor
        #pragma HLS ARRAY_RESHAPE variable=mult block factor=block_factor
        if (CONFIG_T::store_weights_in_bram){
            #pragma HLS RESOURCE variable=weights core=ROM_2P_BRAM
        }

        int multiplier_limit  = ceil(float(CONFIG_T::n_in*CONFIG_T::n_out) / float(CONFIG_T::reuse_factor)) - floor(float(CONFIG_T::n_zeros) / float(CONFIG_T::reuse_factor));
        #pragma HLS ALLOCATION instances=mul limit=multiplier_limit operation

ejk43 commented 6 years ago

Awesome!! That's great to hear. Looks like the magic involves using ARRAY_RESHAPEs with the "block" type. Just curious, any idea why "block" is the correct answer here? I'm pretty confident serial mode needs cyclic partitions, so I wonder why it would be different for the parallel version.

Do you see any issues if the CONFIG_T::n_in*CONFIG_T::n_out / CONFIG_T::reuse_factor is not an integer result?

nhanvtran commented 6 years ago

Awesome!! That's great to hear. Looks like the magic involves using ARRAY_RESHAPEs with the "block" type. Just curious, any idea why "block" is the correct answer here? I'm pretty confident serial mode needs cyclic partitions, so I wonder why it would be different for the parallel version.

Hm, I think that "cyclic" could do the same thing here if I remember what Phil said though I didn't test it explicitly. Somehow block partitions make more sense in my head.

Do you see any issues if the CONFIG_T::n_in*CONFIG_T::n_out / CONFIG_T::reuse_factor is not an integer result?

So strangely I tried the usual construction of ceil( float(CONFIG_T::n_in)*float(CONFIG_T::n_out) / float(CONFIG_T::reuse_factor)) and this actually returned an error, so I chalked it up to a feature of block partitions. This needs to be understood better. The one thing I did observe is that with that code snippet, it gives us the features we want in terms of II and DSP usage for both 1 layer and 3 layer for a range of reuse factors

GiuseppeDiGuglielmo commented 6 years ago

@nhanvtran, @violatingcp

Let's first consider the code in: https://github.com/hls-fpga-machine-learning/hls4ml/blob/ejk/add-brams-to-parallel-mode/nnet_utils/nnet_layer.h This is the code Phil pushed during my visit as an explanatory example.

The first step is to provide an implementation of the mat-mult (conv) on compressed data, as Phil proposed. I want to dicuss it here to be sure we are on the same page.

Currently the code is something like:

// mult and weights are bi-dimensional matrices 
// that we flatten in arrays of size n_in * n_out
for i = 0 .. n_in-1
  for j = 0 .. n_out-1
    index = i * n_out + j
    mult[index] = data[i] * weights[index]

We can flatten everything in registers and the synthesis does the zero suppression for us. In that case the total number of DSP is

n_DSP = ((n_in * n_out) - n_zeros) / reuse

The reuse factor allows us to unroll in time or space (logic) the computation.

If we store zeros in BRAMs the synthesis is not able to zero suppress anymore. A first attempt to solve the problem is to explicitely store non-zero-weights in the BRAMs together with their indeces. If we want to store weights and indeces in a single 32bit word, and the weights are fixed point on 18 bits, we have 14 bits for the index that is 2^14 = 16384 total weights. It is a relatively fair amount of weights in the case of a single-word-index-weight schema (let's call it coupled). In the future, we may consider a decoupled scheme where each index and word are stored separately with more than 14 bits for the indeces.

A preliminary idea (Phil), is to rewrite the code as:

// BRAM stores compressed indeces and weights in a coupled schema
for i = 0 .. ((n_in * n_out) - n_zeros)-1
    U = BRAM[i]
    index = U.index
    weight = U.weight
    j = index % n_out
    matrix[index] = weight * input[j]

Phil, please note that this code is slightly different than the one you drafted on the board.

A couple of considerations:

The modulo operation may be expensive in resources (it depends on the index size), but I need it to determin the index of the input array (whose size is n_in) while index is for matrix (whose size is n_in*n_out;
I have to think about how to play with the reuse factor and the unroll in time and space.

Please, let me know your opinions. In the meantime, will look at @nhanvtran 's code few posts above.

nhanvtran commented 6 years ago

@GiuseppeDiGuglielmo

I think what you describe above is the right approach (nothing sticks out to me but maybe you'll encounter something along the way)

I have to think about how to play with the reuse factor and the unroll in time and space.

Please, let me know your opinions. In the meantime, will look at @nhanvtran 's code few posts above.

The code I posted is similar in spirit to Phil's. The only thing to add is that I don't know if you can naturally do the zero suppression if you do block partition even if you flatten everything in registers. However, with your approach of suppressing the zeroes manually and indexing, this should handle both weights + indices in BRAM or in registers, right?

GiuseppeDiGuglielmo commented 6 years ago

@nhanvtran

Yes, the compressed approach will work for both BRAM and registers. I will keep you posted.

violatingcp commented 6 years ago

Hi guys,

This captures the picture and I think it will work. One thing is that 16384 is not that many weights in the scale of things. If I calculate the max number of weights its 18k/32b x N BRAMS ~ 1.5 Million, which would mean we need 21 bits to cover the full range. There are probably ways to do this better, but I think we should keep this in mind as we go to larger networks.

Thanks, Phil

On Wed, Oct 3, 2018 at 4:13 PM GDG notifications@github.com wrote:

@nhanvtran https://github.com/nhanvtran

Yes, the compressed approach will work for both BRAM and registers. I will keep you posted.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/hls4ml/issues/46#issuecomment-426784279, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4fnQ4JpUyC1sk5EA3vgMDgZ1DbqBks5uhRpQgaJpZM4SK9fM .

GiuseppeDiGuglielmo commented 6 years ago

@violatingcp

I agree to have in mind both that

DNNs with a larger weight space exist
and we may have more BRAM space than we can index with only 14 bits.

More in general, I think it depends on the goals of our work from now on. Many of the results of hls4ml both leverage and aim at

the lowest possible latency (no BRAMs)
with relatively compact NN (you design it for a specific application)
playing with a reuse-factor knob that I never saw so extensively used

If we remove any of those items we may fall back in what other frameworks or people do - either as research or tool development.

In any case, I keep going.

nhanvtran commented 6 years ago

@GiuseppeDiGuglielmo

DNNs with a larger weight space exist

and we may have more BRAM space than we can index with only 14 bits.

More in general, I think it depends on the goals of our work from now on. Many of the results of hls4ml both leverage and aim at

the lowest possible latency (no BRAMs)

with relatively compact NN (you design it for a specific application)

playing with a reuse-factor knob that I never saw so extensively used

If we remove any of those items we may fall back in what other frameworks or people do - either as research or tool development.

It's an interesting point -- I think it's not too bad to have some overlap with other tools though there is a certain spirit in which we do things for sure. We could benchmark our approach against others too.

Speaking in the abstract, I think we aspire do things a little more generically as well (like others only working on SDSoC) and having more ML architectures (BDT, LSTM, ...)

Anyway, I agree with you to start with what you outline and then we can think about scaling up.

There's a more general consideration that 18-bit + 14-bit is just one number we picked out of a hat. We could do 16+16-bit but what if the user wanted to use 30-bit? That would only leave 2-bits for indexing. How do we handle it? With another BRAM? Is there a way to handle this "seamlessly"?

GiuseppeDiGuglielmo commented 6 years ago

[moved from #101]

We have a first functional version of the compressed flow. This is non-optimized yet in terms of HLS, but I organized the code with HLS in mind and it allowed me to warm up with the framework.

In particular,

We have the weights automatically generated in a compressed format (I went through the Python writers), for example:

// w1.h
//Numpy array shape (16, 64)
//Min 0.000000000000
//Max 0.426174551249
//Number of zeros 1021
//Number of non-zeros 3 
compressed_weight_default_t w1[3] = {{ 896, 0.338742017746 }, { 909, 0.395725727081 }, { 996, 0.426174551249 }};

The file firmware/parameters.h is extended with:

typedef ap_uint<14> index_default_t;
typedef struct compressed_weight_default_t { index_default_t index; weight_default_t weight; } compressed_weight_default_t;

etc.

We have now a nnet_utils/nnet_compressed_layer.h that behaves more or less as I described:

// BRAM stores compressed indeces and weights in a coupled schema
for i = 0 .. ((n_in * n_out) - n_zeros)-1
U = BRAM[i]
index = U.index
weight = U.weight
j = index / n_out
matrix[index] = weight * input[j]

I will work on the HLS optimization part and let you know.

Let me know if you prefer me to push on the branch ejk/add-brams-to-parallel-mode or create a separate branch for the compression or do a PR etc

violatingcp commented 6 years ago

Hi @gdg,

Just keep pushing to this branch. It makes it easy.

Phil

On Thu, Oct 4, 2018 at 3:39 PM GDG notifications@github.com wrote:

[moved from #101 https://github.com/hls-fpga-machine-learning/hls4ml/issues/101]

We have a first functional version of the compressed flow. This is non-optimized yet in terms of HLS, but I organized the code with HLS in mind and it allowed me to warm up with the framework.

In particular,

We have the weights automatically generated in a compressed format (I went through the Python writers), for example:

// w1.h //Numpy array shape (16, 64) //Min 0.000000000000 //Max 0.426174551249 //Number of zeros 1021 //Number of non-zeros 3 compressed_weight_default_t w1[3] = {{ 896, 0.338742017746 }, { 909, 0.395725727081 }, { 996, 0.426174551249 }};

The file firmware/parameters.h is extended with:

typedef ap_uint<14> index_default_t; typedef struct compressed_weight_default_t { index_default_t index; weight_default_t weight; } compressed_weight_default_t;

etc.

We have now a nnet_utils/nnet_compressed_layer.h that behaves more or less as I described:

// BRAM stores compressed indeces and weights in a coupled schema for i = 0 .. ((n_in n_out) - n_zeros)-1 U = BRAM[i] index = U.index weight = U.weight j = index / n_out matrix[index] = weight input[j]

I will work on the HLS optimization part and let you know.

Let me know if you prefer me to push on the branch ejk/add-brams-to-parallel-mode or create a separate branch for the compression or do a PR etc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/hls4ml/issues/46#issuecomment-427144138, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4e5c92DrRbryIyqDgbWaKdRiMAbkks5uhmP1gaJpZM4SK9fM .

GiuseppeDiGuglielmo commented 5 years ago

Let me summarize the current situation for further discussions.

1. ceil - segmentation fault

Do not use a function to generate the value of a pragma parameter. The function stack allocation may generate a segmentation fault. This is the case of the ceil function:

int multiplier_limit = ceil(CONFIG_T::n_nonzeros/(float)CONFIG_T::reuse_factor);
#pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit

A possible solution is to use a pre-processor macro:

#define DIV_ROUNDUP(n,d) ((n+d-1)/d)

int multiplier_limit = DIV_ROUNDUP(CONFIG_T::n_nonzeros, CONFIG_T::reuse_factor);
#pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit

See: https://github.com/hls-fpga-machine-learning/hls4ml/blob/28bdeab68a77a4e1a61222564610620afdc47af1/nnet_utils/nnet_compressed_layer.h#L28

2. results for 3-layer model

The following results are from various configurations (keras-config.yml) that use the 3-layer model and nnet_compressed_layer.h.

KerasJson: example-keras-model-files/KERAS_3layer.json
KerasH5:   example-keras-model-files/KERAS_3layer_95pruned_retrained_weights.h5

Previously (e85c33fd4ef1df3a281d8abe9e94fd66d16530f4), I had two values for each non-zero weight (or intermediate result):

index (14 bit)
value (18 bit)

but now I modified the compressed weight encoding to use three values:

number of row (7 bit)
number of column (7 bit)
value (fixed 18 bit)

The generated paramters.h is:

typedef ap_fixed<18,8> weight_default_t;
typedef ap_uint<7> index_default_t;
typedef struct compressed_weight_default_t { index_default_t row_index; index_default_t col_index; weight_default_t weight; } compressed_weight_default_t;

This solution does not require any math (modulo or division operations) to identify the number of row or the number of colums with respect to the single-index solution (index = r * ncols + c, r = index / ncols, c = index % ncols). Please note that this is a tradeoff: the two-index solution saves in terms of logic and latency, but the single-index solution requires less bits than encoding separately the number of row and column of each weight.

Finally, two data structures are stored in the compressed format:

the input weights (weights)
the intermediate matrix-multiplication results (mult) It is an open question if both have to be stored as BRAMs, and if both have to be packed (see later).

For more details about the implementation see: https://github.com/hls-fpga-machine-learning/hls4ml/blob/ejk/add-brams-to-parallel-mode/nnet_utils/nnet_compressed_layer.h I am concentrating on the io_parallel only at the moment.

Given these details, I did a little of design-space exploration.

2.1. compression and registers

In a first set of experiments, I used compression,io_type = nnet::io_parallel, and store_weights_in_bram = false. This is just to prove the compression can be used with registers, although the main goal is to use compression and BRAM together.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	16	1	13	47	3193	6400
2	18	2	13	29	2928	6976
4	24	4	13	15	3233	7018
8	32	8	13	8	3251	6915
16	45	15	13	5	3344	6834
32	37	26	13	4	3450	6954

As expected, by increasing the reuse factor, we pay the price of a higher latency, but we save in terms of DSPs. Everythings seems to scale fairly well.

A final note, for these experiments, HLS is pretty fast, while HLS time is significantly bigger for the next sets of experiments.

See: https://github.com/hls-fpga-machine-learning/hls4ml/blob/28bdeab68a77a4e1a61222564610620afdc47af1/nnet_utils/nnet_compressed_layer.h#L80-L81

2.2. compression and BRAMs

In a second set of experiments, I used compression,io_type = nnet::io_parallel, store_weights_in_bram = true. The three values (row_index, col_index, weight) are stored separately.

I mapped only weights on BRAMs. We have the the trend we want to see: reuse factor goes up, II is the same as reuse factor, latency goes up, DSPs go down.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	25	1	13	55	16567	21225
2	28	2	37	29	11956	25009
4	33	4	27	15	9625	25035
8	43	8	23	8	11455	24035
16	58	16	22	5	11547	24529
32	74	32	22	4	11402	24962

I mapped both weights and mult on BRAMs. In this case, reuse factor goes up, DSPs go down, but the latency is roughtly constant for all of the experiments.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	70	1	13	55	110812	37807
2	73	2	48	29	69171	41618
4	73	4	31	15	41793	45134
8	73	8	26	8	30727	39137
16	73	16	25	5	27367	36234
32	74	32	25	4	26738	36223

For low values of reuse factor, HLS time is really long, while HLS becomes faster for higher values of reuse factor. I am wondering how HLS will behave with really big layers. This is something that could be a road block and should be investigated more.

2.3. compression, BRAMs and data_pack

The goal here is to stored the three values (row_index, col_index, weight) in a single 32-bit word.

typedef struct compressed_weight_default_t {
  index_default_t row_index;
  index_default_t col_index;
  weight_default_t weight;
} compressed_weight_default_t;

I want to fetch those three values in a single (memory) access. The previous solution required three concurrent accesses to three different memories and that could be definitively extra logic and more BRAMs. The easiest solution is to use the pragma HLS data_pack. This sets the data fields of the struct into a single scalar with a wider word width.

See: https://github.com/hls-fpga-machine-learning/hls4ml/blob/28bdeab68a77a4e1a61222564610620afdc47af1/nnet_utils/nnet_compressed_layer.h#L70-L79

In the future, we can play around with the sizes of the two indexes to optimize the use of BRAMs. Currently, I naively use 7 bits for each index: 7 + 7 + 18 = 32. Note that a 7-bit index allows us to address the rows or colums of a weight matrix up to 128x128 (2^7 = 128).

I mapped only weights on BRAMs in the packed format. As above: reuse factor goes up, II is the same as reuse factor, the latency is still constant, DSPs go down. As expected, this solution uses less BRAMs than having the three values of the struct non packed.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	70	1	13	55	110876	37839
2	73	2	54	29	69346	41604
4	73	4	32	15	41753	45029
8	73	8	24	8	30579	38954
16	73	21	18	5	27183	35959
32	74	20	17	4	26806	35803

(Updated d1d399df3d99b38ba9033f6c91768901a7bd76b0)

I mapped both weights and mult on BRAMs in the packed format. In this case, the latency is still constant, but the II is non equal to the reuse factor. I assume that there is some issue in packing and writing the three values in a single clock cycle. This requires further investigation too.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	69	1	32	55	112499	37851
2	75	6	102	29	75277	41351
4	73	18	55	15	33509	42969
8	73	14	35	8	30037	37136
16	76	27	26	5	29438	35494
32	89	48	23	4	29279	35652

Future directions:

I want to understand better if and how packed format can be useful;
I may need bigger models to see if the approach scale (I am worried with the synthesis time);
I have to investigate io_serial.

Please let me know your comments.

GiuseppeDiGuglielmo commented 5 years ago

A really preliminary implementation for io_serial (fe982386a2d2acdfae9487131934ea4eba94360f).

I used compression, io_type = nnet::io_serial, store_weights_in_bram = true, and I mapped only weights on BRAMs in the packed format.

I temporarely disabled DATAFLOW because of the error:

Running pass 'Scalar propagation for dataflow pipeline' on module '/extras/giuseppe/research/projects/hls4ml/hls4ml-brams.git/keras-to-hls/my-hls-test/myproject_prj/solution1/.autopilot/db/a.o.1.bc'.                                                  
Abnormal program termination (11)

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	64	1	13	55	88519	34797
2	71	2	68	29	64679	38238
4	73	8	42	15	43243	41075
8	73	8	28	8	31712	39413
16	73	32	22	5	29382	37732
32	74	32	20	4	28979	37736

I am not satisfied yet with these results. FFs and LUTs seems in general too many. We can "control" DSP pretty well with the pragma HLS ALLOCATION instances=mul. I cannot "control" yet well enough the latency and BRAMs.

@nhanvtran, @violatingcp, @ejk43 do you have any comments?

I may need two new models:

1 layer with sparse weights (let's say 128x128 and something like 128/256 non-zeros); do you have anything like this ready? I can hardcode my own "fake model" in case.
a relatively big model to understand the scalability of the approach.

violatingcp commented 5 years ago

Hi @GiuseppeDiGuglielmo

This is great and this looks like the right approach! Let me reply to the studies below

Let me summarize the current situation for further discussions.

1. ceil - segmentation fault

Do not use a function to generate the value of a pragma parameter. The function stack allocation may generate a segmentation fault. This is the case of the ceil function:
int multiplier_limit = ceil(CONFIG_T::n_nonzeros/(float)CONFIG_T::reuse_factor);
#pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit
A possible solution is to use a pre-processor macro:
#define DIV_ROUNDUP(n,d) ((n+d-1)/d)

int multiplier_limit = DIV_ROUNDUP(CONFIG_T::n_nonzeros, CONFIG_T::reuse_factor);
#pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit
See:

hls4ml/nnet_utils/nnet_compressed_layer.h

Line 28 in 28bdeab

define DIV_ROUNDUP(n,d) ((n + d - 1) / d)

**This approach seems fine. I had issues with the ceil as well.

2. results for 3-layer model

The following results are from various configurations (keras-config.yml) that use the 3-layer model and nnet_compressed_layer.h.
KerasJson: example-keras-model-files/KERAS_3layer.json
KerasH5:   example-keras-model-files/KERAS_3layer_95pruned_retrained_weights.h5
Previously (e85c33f), I had two values for each non-zero weight (or intermediate result):

index (14 bit)

value (18 bit)

but now I modified the compressed weight encoding to use three values:

number of row (7 bit)

number of column (7 bit)

value (fixed 18 bit)

** This makes the code cleaner. We have generally been using 1D arrays because HLS seems to prefer them. However, 2D definitely makes it easier to read.

The generated paramters.h is:
typedef ap_fixed<18,8> weight_default_t;
typedef ap_uint<7> index_default_t;
typedef struct compressed_weight_default_t { index_default_t row_index; index_default_t col_index; weight_default_t weight; } compressed_weight_default_t;
This solution does not require any math (modulo or division operations) to identify the number of row or the number of colums with respect to the single-index solution (index = r * ncols + c, r = index / ncols, c = index % ncols). Please note that this is a tradeoff: the two-index solution saves in terms of logic and latency, but the single-index solution requires less bits than encoding separately the number of row and column of each weight.

Finally, two data structures are stored in the compressed format:

the input weights (weights)

the intermediate matrix-multiplication results (mult) It is an open question if both have to be stored as BRAMs, and if both have to be packed (see later).

** Storing the matrix-multiplication adds the option of having weights go off-chip. This is something we have avoided since in the literature it leads to large latencies, but it definitely does not hurt to have the option though.

For more details about the implementation see: https://github.com/hls-fpga-machine-learning/hls4ml/blob/ejk/add-brams-to-parallel-mode/nnet_utils/nnet_compressed_layer.h I am concentrating on the io_parallel only at the moment.

** Great. The io_serial case seems less of a difficulty and below I see you have looked at that too 👍

Given these details, I did a little of design-space exploration.

2.1. compression and registers

In a first set of experiments, I used compression,io_type = nnet::io_parallel, and store_weights_in_bram = false. This is just to prove the compression can be used with registers, although the main goal is to use compression and BRAM together.

Reuse Factor Latency Interval BRAMs DSPs FFs LUTs 1 16 1 13 47 3193 6400 2 18 2 13 29 2928 6976 4 24 4 13 15 3233 7018 8 32 8 13 8 3251 6915 16 45 15 13 5 3344 6834 32 37 26 13 4 3450 6954 As expected, by increasing the reuse factor, we pay the price of a higher latency, but we save in terms of DSPs. Everythings seems to scale fairly well.

** Note the 47 DSPs here and the small number of FFs/LUTs. This is a result of the fact that the zero multiplications get compiled away. Do you know why the latency drops when going form Reuse 16 to 32, is this just a typo?

A final note, for these experiments, HLS is pretty fast, while HLS time is significantly bigger for the next sets of experiments.

See: hls4ml/nnet_utils/nnet_compressed_layer.h

Lines 80 to 81 in 28bdeab

pragma HLS ARRAY_PARTITION variable=weights complete

pragma HLS ARRAY_PARTITION variable=mult complete

** Are you saying that complete partitions make the HLS synthesis fast? We know that complete partition leads to a large use of memory and at some point breaks.

2.2. compression and BRAMs

In a second set of experiments, I used compression,io_type = nnet::io_parallel, store_weights_in_bram = true. The three values (row_index, col_index, weight) are stored separately.

I mapped only weights on BRAMs. We have the the trend we want to see: reuse factor goes up, II is the same as reuse factor, latency goes up, DSPs go down.

Reuse Factor Latency Interval BRAMs DSPs FFs LUTs 1 25 1 13 55 16567 21225 2 28 2 37 29 11956 25009 4 33 4 27 15 9625 25035 8 43 8 23 8 11455 24035 16 58 16 22 5 11547 24529 32 74 32 22 4 11402 24962

** So ideally removing the zero multiplications should give us 47 DSPs and the same latency. Are you removing these elements? Also why are the FF/LUTs much higher? Is it the lack of zero suppression

I mapped both weights and mult on BRAMs. In this case, reuse factor goes up, DSPs go down, but the latency is roughtly constant for all of the experiments.

Reuse Factor Latency Interval BRAMs DSPs FFs LUTs 1 70 1 13 55 110812 37807 2 73 2 48 29 69171 41618 4 73 4 31 15 41793 45134 8 73 8 26 8 30727 39137 16 73 16 25 5 27367 36234 32 74 32 25 4 26738 36223 For low values of reuse factor, HLS time is really long, while HLS becomes faster for higher values of reuse factor. I am wondering how HLS will behave with really big layers. This is something that could be a road block and should be investigated more.

** Ok so now the limitation is coming from reading and writing the multiplications to BRAM. Its not completely clear why this should be constant. Is it the reading of the mult array at every step? Do you know where the 73 clocks come from? Is there a formula for this? Maybe I am not visualizing the mapping correctly. Also why are the FF/LUTs much higher?

2.3. compression, BRAMs and data_pack

The goal here is to stored the three values (row_index, col_index, weight) in a single 32-bit word.
typedef struct compressed_weight_default_t {
  index_default_t row_index;
  index_default_t col_index;
  weight_default_t weight;
} compressed_weight_default_t;
I want to fetch those three values in a single (memory) access. The previous solution required three concurrent accesses to three different memories and that could be definitively extra logic and more BRAMs. The easiest solution is to use the pragma HLS data_pack. This sets the data fields of the struct into a single scalar with a wider word width.

See: hls4ml/nnet_utils/nnet_compressed_layer.h

Lines 70 to 79 in 28bdeab
      if (CONFIG_T::store_weights_in_bram) { 
pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit

pragma HLS RESOURCE variable=weights core=ROM_2P_BRAM

pragma HLS ARRAY_RESHAPE variable=mult block factor=multiplier_limit

pragma HLS RESOURCE variable=mult core=RAM_2P_BRAM
 // Pack the row_index, col_index, and weight in a single 32-bit memory element 
pragma HLS data_pack variable=weights struct_level

pragma HLS data_pack variable=mult struct_level
      } else { 
In the future, we can play around with the sizes of the two indexes to optimize the use of BRAMs. Currently, I naively use 7 bits for each index: 7 + 7 + 18 = 32. Note that a 7-bit index allows us to address the rows or colums of a weight matrix up to 128x128 (2^7 = 128).

I mapped only weights on BRAMs in the packed format. As above: reuse factor goes up, II is the same as reuse factor, the latency is still constant, DSPs go down. As expected, this solution uses less BRAMs than having the three values of the struct non packed.

** This is a nice approach.

Reuse Factor Latency Interval BRAMs DSPs FFs LUTs 1 70 1 13 55 110876 37839 2 73 2 54 29 69346 41604 4 73 4 32 15 41753 45029 8 73 8 24 8 30579 38954 16 73 21 18 5 27183 35959 32 74 20 17 4 26806 35803 (Updated d1d399d)

** How do you explain the differences in teh pipeline interval? Also, do you have the results without putting mult in the bram. It would be nice to see if we can keep the smaller latencies with this compression. Furthermore can we get the zeros removed.

I mapped both weights and mult on BRAMs in the packed format. In this case, the latency is still constant, but the II is non equal to the reuse factor. I assume that there is some issue in packing and writing the three values in a single clock cycle. This requires further investigation too.

Reuse Factor Latency Interval BRAMs DSPs FFs LUTs 1 69 1 32 55 112499 37851 2 75 6 102 29 75277 41351 4 73 18 55 15 33509 42969 8 73 14 35 8 30037 37136 16 76 27 26 5 29438 35494 32 89 48 23 4 29279 35652 Future directions:

I want to understand better if and how packed format can be useful;

I may need bigger models to see if the approach scale (I am worried with the synthesis time);

I have to investigate io_serial.

Please let me know your comments.

** This is a great start. If we can put the zero suppression, I think we should start looking a much bigger networks @jmduarte has network he can add to this branch.

nhanvtran commented 5 years ago

@GDG Really awesome results to see! Especially to see the control of the interval

Following @violatingcp 's comments, just a few more:

I have very basic high level question. It was really nice to see results in 2.1 compression and registers! Why is it that compared to 2.2 and 2.3, 2.1 has the least resource usage for FFs and LUTs? I would expect those to go down as you store more weight and multiplications in the BRAMs.
- Also a follow-up question to make sure: for the 2.1 you are using the block partitioning? and you got similar results comparing with complete partitioning?
- with 2.1, are you already implementing the sparse matrix multiply with the indices or just using "natural" compression of HLS?
I'm not surprised that small reuse factors have long synthesis times while large reuse factors (effectively less partitioning) is faster. What is "long" here? Like days? hours?
- for 2.2 I was naively expecting to only store the weights in BRAMs though it's nice to see this compared to the case where you also store the multiplications in the BRAMs too
- For the FF and LUT usage -- we have seen in the past that those have been greatly over estimated by HLS. I wonder if this contributes to the FF/LUT usage comparing 2.2 and *2.1
Echoing @violatingcp , for 2.3 it's strange to see the data_pack causing issues with the interval. My initial reaction is that losing some control of interval is a symptom of something bad?

Overall -- it's really exciting to already see the performance of 2.1 and 2.2 and this is a major step forward for the tool. I agree -- Trying with the 100x100 layer and 200x200 layers from @jmduarte are a nice next test.

GiuseppeDiGuglielmo commented 5 years ago

@violatingcp

Storing the matrix-multiplication adds the option of having weights go off-chip. This is something we have avoided since in the literature it leads to large latencies, but it definitely does not hurt to have the option though.

Please let me undestand better your concerns.

Issue 1

BRAMs are on-chip memory elements, they are synchronous and thus slower than accessing registers, but the access time is still better than DRAM (which is off-chip, on the board).

With Xilinx FPGAs, we can leverage the following memory hierarchy:

on-chip memory
- register (1 clock to write, ~200ps delay to read)
- BRAM (1 clock to write, 1 clock to read)
- URAM (1 clock to write, 1 clock to read) --> UltraRAM in UltraScale+
off-chip memory
- DRAM (at least 10-20 clock to write/read)

Registers are definitely better for ultra-low latency designs. BRAM and URAM have some tradeoffs, but will allow us to store a bigger model without paying the penalty of going off-chip (1) throught the interconnect and (2) up to the DRAM.

There are few more details on UltraRAMs here:

I did not use yet URAM for HLS designs, but ideally we could stay alway on chip, store the bulk of the model in URAM, and use BRAM and registers for caching data.

Issue 2

The buffer mult stores the compressed matrix multiplication as an intermediate result of compute_compressed_layer: https://github.com/hls-fpga-machine-learning/hls4ml/blob/28bdeab68a77a4e1a61222564610620afdc47af1/nnet_utils/nnet_compressed_layer.h#L43

The size of mult is the same of the model (CONFIG_T::n_nonzeros). In our near future, the models will be big: 100K-1M elements, maybe less if the model is compressed and has high sparsity.

At that point, I think it should be impossible for us to store the intermediate results in only registers.

Please let me know your thoughts.

Note the 47 DSPs here and the small number of FFs/LUTs. This is a result of the fact that the zero multiplications get compiled away. Do you know why the latency drops when going form Reuse 16 to 32, is this just a typo?

I double checked the results from my notes and re-run the experiments. The Latency = 37 for RF = 32 is a typo. It i s 57.

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	16	1	13	47	3193	6400
2	18	2	13	29	2928	6976
4	24	4	13	15	3233	7018
8	32	8	13	8	3251	6915
16	45	15	13	5	3344	6834
32	57	26	13	4	3450	6954

If we look at the first line of the table, it is obvious that when HLS uses registers and maximum parallelism (RF = 1), it produces less additional logic for the intermediate results. HLS directly wires DSP and arithmetic logic to the registers. If we look to all of the other lines of the table, we can observe that the HLS results are consistently growing or decreasing.

But let's keep in mind that HLS scheduling is a heuristic. The more complex is the HLS problem, the more the scheduling may be non perfect. This is why in a design-space exploration, we may observe "local perturbations" of those trends.

Are you saying that complete partitions make the HLS synthesis fast? We know that complete partition leads to a large use of memory and at some point breaks.

The link to the complete partition is for the section on compression and registers.

The note on the bigger synthesis time is for the following sections on synthesis with BRAMs.

So ideally removing the zero multiplications should give us 47 DSPs and the same latency. Are you removing these elements? Also why are the FF/LUTs much higher? Is it the lack of zero suppression?

Ideally, given a RF value, we should have the same amount of DSPs for both the register-only implementation and the BRAM implementation. This is true in all of the above result tables, except for RF = 1. As you noticed we have 55 DSP while we should expect 47.

I double checked and it is not a typo.
I increased the HLS effort (config_schedule -effort high), but it did not help

I will try to pin point the reason of that.

The higher value fo FF/LUTs is what really bothers me more. It comes with the use of BRAM, but I should find a way to first explain it and then limit it.

@violatingcp

Ok so now the limitation is coming from reading and writing the multiplications to BRAM. Its not completely clear why this should be constant. Is it the reading of the mult array at every step? Do you know where the 73 clocks come from? Is there a formula for this? Maybe I am not visualizing the mapping correctly. Also why are the FF/LUTs much higher?

How do you explain the differences in teh pipeline interval? Also, do you have the results without putting mult in the bram. It would be nice to see if we can keep the smaller latencies with this compression. Furthermore can we get the zeros removed.

All of the other questions are currently under investigation.

@nhanvtran

I have very basic high level question. It was really nice to see results in 2.1 compression and registers! Why is it that compared to 2.2 and 2.3, 2.1 has the least resource usage for FFs and LUTs? I would expect those to go down as you store more weight and multiplications in the BRAMs.

As above, this is a main concern. I am wondering if there is some sort of intermediate caching between BRAM and logic.

Also a follow-up question to make sure: for the 2.1 you are using the block partitioning? and you got similar results comparing with complete partitioning?

For the results in 2.1. compression and registers, I use complete partition

#pragma HLS ARRAY_PARTITION variable=weights complete
#pragma HLS ARRAY_PARTITION variable=mult complete

I am not sure block partition could have being useful for the weights in this case.

with 2.1, are you already implementing the sparse matrix multiply with the indices or just using "natural" compression of HLS?

I am using sparse matrix multiply with the indices. I am not leveraging the "natural compression" of HLS.

I'm not surprised that small reuse factors have long synthesis times while large reuse factors (effectively less partitioning) is faster. What is "long" here? Like days? hours?

In the case of the 3 layer model, for low reuse factors (RF= 1 or RF=2), the synthesis times are like 20 minutes. With registers, the synthesis time is less than a minute (~30 seconds).

for 2.2 I was naively expecting to only store the weights in BRAMs though it's nice to see this compared to the case where you also store the multiplications in the BRAMs too

As at the beginning of this post, I think the intermediate results could / should be stored in BRAM otherwise we cannot attack bigger models.

For the FF and LUT usage -- we have seen in the past that those have been greatly over estimated by HLS. I wonder if this contributes to the FF/LUT usage comparing 2.2 and *2.1

This is a really good point.

These are some results

2.1. compression and registers

Synthesis	Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
HLS	1	16	1	13	47	3193	6400
Logic	1	16	1	13	47	2061 (-36%)	1259 (-81%)

The logic synthesis (Vivado) reduces the significantly the fabric resources if we compare with HLS reported results.

2.2. compression and BRAMs where I mapped both weights and mult on BRAMs and no data_pack

Synthesis	Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
HLS	1	70	1	13	55	110812	37807
Logic	1	70	1	13	55	7914 (-93%)	6586 (-83%)

In this case, the logic synthesis reduces even more significantly the resources.

BUT, we did not achieve yet less FFs/LUTs for the BRAM implementation (2.2) w.r.t. the implementation that stores the weights into registers (2.1).

As I said above, I will keep working on it.

Echoing @violatingcp , for 2.3 it's strange to see the data_pack causing issues with the interval. My initial reaction is that losing some control of interval is a symptom of something bad?

I will work on it.

In general, we have a good starting point. There are some issues that I will address and I will keep you posted.

@nhanvtran , as you suggested I may branch and PR, but before let's have one of our bi-weekly meeting and let's see if I can close at least some of those open questions.

GiuseppeDiGuglielmo commented 5 years ago

I branched master to gdg/compression

https://github.com/hls-fpga-machine-learning/hls4ml/tree/gdg/compression

This may simplify a future PR. I will keep working on it. First thing compression + registers + block partition.

Please, can you add the larger models to this?

GiuseppeDiGuglielmo commented 5 years ago

@nhanvtran

I went back to my compressed register-only implementation and I am still convinced that block reshaping (or partitioning) has to be used only with BRAM mapping and does not affect the results when the weights (and any other internal structures) are not mapped on BRAMs.

For example, if we replace the following lines and the store_weights_in_bram = false:

https://github.com/hls-fpga-machine-learning/hls4ml/blob/7b18c1d840f19d12df6af41f79008ca18f425484/nnet_utils/nnet_compressed_layer.h#L86-L87

with

#pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit
#pragma HLS ARRAY_RESHAPE variable=mult block factor=multiplier_limit

the results are the same.

I may have misunderstood your request. Can you please explain it again?

nhanvtran commented 5 years ago

Hi @GDG Thanks for the test

In fact, what you are finding is exactly what I wanted to see 👍 , i.e. that the results are unchanged going to block partition (also at higher resuse factors, is this confirmed?)

The first issue that we had in going to bigger models is the memory issue in verions 2018.1 and earlier. Then in 2018.2, we started to get the issue that we cannot complete partition arrays of size larger than 1024. We currently get around that issue by increasing this limit for the 3-layer model: https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/hls-template/build_prj.tcl#L24

However, this is just a band-aid, for even larger unrolls (> 4096) we will continue to run into issues. By going to block partition, we can get past this restriction and I think this is even more friendly to the HLS compiler (you reported that higher reuse factor synthesized faster).

So -- in summary -- the results should be functionally the same, but should allow us to synthesize bigger models. Then to go to even bigger models, we have to start putting the weights into the BRAMs, which is the next step. Let me know if this makes sense

GiuseppeDiGuglielmo commented 5 years ago

@nhanvtran,

yes the results are the same for both ARRAY_PARTITION complete and ARRAY_RESHAPE block:

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	16	1	13	47	3193	6400
2	18	2	13	29	2928	6976
4	24	4	13	15	3233	7018
8	32	8	13	8	3251	6915
16	45	15	13	5	3344	6834
32	57	26	13	4	3450	6954

I got the idea of using ARRAY_RESHAPE block as a band-aid of some restriction on ARRAY_PARTITION complete. Thank you :-)

Let's try bigger models for both registers (to see if there is any further issue) and BRAMs!

thesps commented 5 years ago

I tried adding array partitioning to the resource reuse I've implemented for pooling in #117 with #pragma hls array_partition block factor = .... There is also something like the multiplier limit, but for the pooling function (#pragma hls allocation). I actually see a different resource usage with the same reuse factor depending on the array partition factor, in a not intuitive way (and it uses no BRAMs, only LUTs and FFs), you can see a table on the PR.

What is the multiplier_limit in this case just above? If the partition/reshape factor is larger than array size, it could just be the same as complete partitioning by another name.

nhanvtran commented 5 years ago

@GiuseppeDiGuglielmo if you are comfortable with the block fix, can we put this in as a first PR just so that we can solve this intermediate problem -- this is a full fix for big networks with weights-as-registers, right?

@thesps the multiplier limit is defined here: https://github.com/hls-fpga-machine-learning/hls4ml/blob/gdg/compression/nnet_utils/nnet_compressed_layer.h#L69

For your findings, let me go to #117

nhanvtran commented 5 years ago

Ah one quick thought -- before merging this intermediate fix, we should test if it works on big models. @jmduarte @benjaminkreis are the 100x100,200x200,500x500 models available?

GiuseppeDiGuglielmo commented 5 years ago

@nhanvtran, in the meantime I get bigger models, I want to check the other branch (io_serial). https://github.com/hls-fpga-machine-learning/hls4ml/blob/3ae1edb7507339265f3e894409e0bff99ba7847b/nnet_utils/nnet_compressed_layer.h#L96 We want both the modes io_serial and io_parallel with registers and reshape block, right?

If the code works fine with big models, I will not delay too much the PR.

@thesps, as @nhanvtran pointed multiplier_limit = DIV_ROUNDUP(CONFIG_T::n_nonzeros, CONFIG_T::reuse_factor). Please, notice that n_nonzeros is the size of the array to be block-reshaped.

nhanvtran commented 5 years ago

@nhanvtran, in the meantime I get bigger models, I want to check the other branch (io_serial). hls4ml/nnet_utils/nnet_compressed_layer.h

It would be good to understand io_serial. My first thought is that it doesn’t need this fix because it’s using cyclic partitioning. However, someone recently reported that io_serial ran into some issues for very big models so I wanted to understand this. It was being discussed in #118

GiuseppeDiGuglielmo commented 5 years ago

In the later afternoon I will look into it and let you know.

GiuseppeDiGuglielmo commented 5 years ago

Before sharing the results of _ioserial + compression + registers, I would like to discuss the results of the existing implementation (nnet_utils/nnet_layer.h which uses registers and io_serial and does not have an explicit compression). To be sure I am on the right track.

All of the results are on the 3-layer model.

1) Original Code

I first considered the repository code that uses a workaround for the ceil:

if (CONFIG_T::io_type == io_serial) {
        // Only reduce cycle_factor if n_out is evenly divisible by reuse_factor
        // Otherwise, HLS wont be happy
        int cycle_factor = CONFIG_T::n_out;
        float reused_cycle = CONFIG_T::n_out / CONFIG_T::reuse_factor;
        if (reused_cycle == ceil(reused_cycle)){
            // Dont use "ceil" here; as of 2018.2, HLS crashes mysteriously
            cycle_factor = cycle_factor / CONFIG_T::reuse_factor;
        }
        #pragma HLS ARRAY_PARTITION variable=weights cyclic factor=cycle_factor
        #pragma HLS ARRAY_PARTITION variable=mult cyclic factor=cycle_factor
        #pragma HLS ARRAY_PARTITION variable=acc complete
        #pragma HLS DATAFLOW
        #pragma HLS STREAM variable=mult depth=1
        #pragma HLS STREAM variable=acc depth=1
}

These are the results:

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	106	88	2	24	4808	18184
2	183	165	7	33	5470	17067
4	298	261	12	30	5837	16352

For values of RF greater than 4 (for some layers factor=0), the synthesis fails:

ERROR: [XFORM 203-103] Cannot partition array 'w4.V' : incorrect partition factor 0.
ERROR: [HLS 200-70] Pre-synthesis failed.

2) Use macro DIV_ROUNDUP for ceil

if (CONFIG_T::io_type == io_serial) {
        // Replace ceil function with home-made macro prevents Vivado 2018.2 segfault.
        int cycle_factor = DIV_ROUNDUP(CONFIG_T::n_out, CONFIG_T::reuse_factor);
        #pragma HLS ARRAY_PARTITION variable=weights cyclic factor=cycle_factor
        #pragma HLS ARRAY_PARTITION variable=mult cyclic factor=cycle_factor
        #pragma HLS ARRAY_PARTITION variable=acc complete
        #pragma HLS DATAFLOW
        #pragma HLS STREAM variable=mult depth=1
        #pragma HLS STREAM variable=acc depth=1
}

This solution removes the above workaround and uses DIV_ROUNDUP for ceiling. Moreover, DIV_ROUNDUP(a,b) returns 1 when b > a as ceil(a/b) is supposed to return.

The results are as follows:

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	106	88	2	24	4808	18184
2	184	176	5	33	6458	17832
4	298	261	13	30	5918	16571
8	554	517	13	17	6039	16225
16	1066	1029	9	9	6273	16387
32	2090	2053	7	5	6867	17488

In general, we have less DSP with a higher RF. The BRAM usage trend is a little weird.

Do you have any comment on these results (both 1. and 2.)?

Are the results less appealing than the _ioparallel solution? I copied and pasted here:

Reuse Factor	Latency	Interval	BRAMs	DSPs	FFs	LUTs
1	16	1	13	47	3193	6400
2	18	2	13	29	2928	6976
4	24	4	13	15	3233	7018
8	32	8	13	8	3251	6915
16	45	15	13	5	3344	6834
32	57	26	13	4	3450	6954

Do you have cases when the _ioserial solution provide better trade-offs w.r.t. the _ioparallel solution?

jmduarte commented 5 years ago

Ah one quick thought -- before merging this intermediate fix, we should test if it works on big models. @jmduarte @benjaminkreis are the 100x100,200x200,500x500 models available?

@nhanvtran here are the additional big dense models: PR #119

16->100->100->100->100->100->5 (same as current "big" model)
16->200->200->200->200->200->5 (previous "big" model)
16->500->500->500->500->500->5 (new)

ejk43 commented 5 years ago

Hi @GiuseppeDiGuglielmo, Nice results! Good to see the clear presentation here.

As a general comment on io_serial, we're assuming inputs and outputs are clocked into and out of a logic block one sample at a time. This is in contrast to the io_parallel approach, which presents ALL inputs at a single clock cycle. With this "clock in/clock out" approach for io_serial, reuse_factor = 1 means that every input multiplies all of it's weights at the time it is input into the layer. So a 32x10 layer will perform 10 multiplies when presented with the first input, 10 multiplies for the second input, and so on. For reuse_factor = 2, the layer will perform 5 multiplies when presented with the first input, then reuse those DSPs to calculate the next 5 multiplies (for an II of 2).

To address your specific questions/results:

This solution removes the above workaround and uses DIV_ROUNDUP for ceiling. Moreover, DIV_ROUNDUP(a,b) returns 1 when b > a as ceil(a/b) is supposed to return.

Great fix! I like the DIV_ROUNDUP macro. It's been a while, but I believe I only ever tested reuse factor as an even divisor of the total number of outputs and I definitely did not try reuse factors > number outputs. This looks like a good change we should pull into master whenever possible.

Do you have any comment on these results (both 1. and 2.)? Do you have cases when the io_serial solution provide better trade-offs w.r.t. the io_parallel solution?

It's a little tricky to comment on specific findings since I would typically want to tune the io_serial reuse factors for each layer individually based on the neighboring layer sizing and desired latency for the IP core as a whole.. It's plausible the reuse_factors swept here create poor combinations of reuse vs latency for the layer sizes in the 3-layer model-- which might be inflating DSPs or BRAMs. I may try a few of these cases to understand exactly where the DSPs and BRAMs are getting thrown off.

That said, it sure looks to me like io_parallel gives better results for this particular test case. It's peculiar that BRAMs start low and peak at a reuse of 8. And it's especially discouraging that FFs and LUTs end up with 2-3x higher usage across the board -- this seems like a real problem somewhere compared to io_parallel.

The premise of io_serial is that if the user already has a serial interface to the IP core, say streamed via DMA or something, then the IP core should be able to take advantage of this streaming architecture to perform more efficiently than if we're optimizing for low-latency (with all inputs presented to the core in one cycle). Agreed?

On the other hand, your apples-to-apples comparison here seems to suggest the user is perhaps better off accumulating serial samples into a chunk of data, presenting the data chunk to the IP core in io_parallel mode, and then reading out the results. I wonder if this is true across all plausible layer instantiations? Perhaps a layer with a very large number of inputs would synthesize better in serial mode? Notionally, I've been thinking of using serial mode for a network like (1024x64, 64x64, 64x64, 64x8) or something, which would have reuse factors of like (1, 16, 16, 8), so the downstream layers can have larger reuse factors than the first layer which is absorbing the full-rate input data at II=1 without increasing overall latency.

Definitely open to ideas -- and would love to see a serial mode that hits the fabric usage of parallel mode :)

GiuseppeDiGuglielmo commented 5 years ago

@jmduarte Thank you for sharing!

Do you have a pruned version of those models? I am currently running them as a stress test, but highly pruned models would fit better my implementation that - for each weight - store its value and two indices.

GiuseppeDiGuglielmo commented 5 years ago

@ejk43,

Thank you for your explanation about the io_serial mode!

The premise of io_serial is that if the user already has a serial interface to the IP core, say streamed via DMA or something, then the IP core should be able to take advantage of this streaming architecture to perform more efficiently than if we're optimizing for low-latency (with all inputs presented to the core in one cycle). Agreed?

Yes.

I need some time to think about it and come up with an efficient implementation. I will use the bigger models for tuning the io_parallel, while I may work with small models to investigate the io_serial.

GiuseppeDiGuglielmo commented 5 years ago

@jmduarte

16->100->100->100->100->100->5 (same as current "big" model)

16->200->200->200->200->200->5 (previous "big" model)

16->500->500->500->500->500->5 (new)

Do you have the pruned versions for these three models? For the 3-layer model, there are both a 70% and a 95% pruned version. Can you do the same for your big models?

Thank you!

jmduarte commented 5 years ago

@GiuseppeDiGuglielmo do you care about keeping the performance being the same? Or you just want to test the framework with some fraction of weights set to 0?

If you just want a version of these models with 70% and 95% of the weight set to 0, then I can produce that very fast. See here for two pruned versions of the 16->500->500->500->500->500->5 model: https://www.dropbox.com/sh/iup7aodxhb8zlel/AADGTLt5RN5A8lQyHUpgu0nba?dl=0

Note the histogram of distribution of weights: pruned_model_70perc_weight_histogram_logx

In this case, I just pruned the bottom 70% (or 95%) of the weights based on their magnitude relative to the max weight in the respective layer. This will definitely make the performance worse as I haven't done anything special when training the model in the first place.

To keep the performance stable, I would need to re-train the models with L1 regularization then iteratively prune away unneeded weights. I can do this too, but it will take a bit longer.

Thanks,

Javier

GiuseppeDiGuglielmo commented 5 years ago

@jmduarte, for the moment I do not need keeping the performance the same, as you said, I just need a fraction of the weights being zeros.

If you can do the same of the "500 network" for the 200 and 100 ones, it will be great.

Thank you also for clarifying the process!

jmduarte commented 5 years ago

@GiuseppeDiGuglielmo Here are the pruned versions of the other models:

100: https://www.dropbox.com/sh/4oxpvblrzpus1kz/AACnhrqsI93e6q9dLWIq7mxGa?dl=0 200: https://www.dropbox.com/sh/jb3he4jtud6wn54/AACRpKMoQs11feYRSgtdby9Ya?dl=0 500: https://www.dropbox.com/sh/iup7aodxhb8zlel/AADGTLt5RN5A8lQyHUpgu0nba?dl=0

Thanks, Javier

GiuseppeDiGuglielmo commented 5 years ago

@jmduarte

Thank you!

fastmachinelearning / hls4ml

store_in_bram functionality #46

pragma HLS RESOURCE variable=weights core=ROM_2P_BRAM

pragma HLS ARRAY_RESHAPE variable=weights factor=cycle_factor*2

define DIV_ROUNDUP(n,d) ((n + d - 1) / d)

pragma HLS ARRAY_PARTITION variable=weights complete

pragma HLS ARRAY_PARTITION variable=mult complete

pragma HLS ARRAY_RESHAPE variable=weights block factor=multiplier_limit

pragma HLS RESOURCE variable=weights core=ROM_2P_BRAM

pragma HLS ARRAY_RESHAPE variable=mult block factor=multiplier_limit

pragma HLS RESOURCE variable=mult core=RAM_2P_BRAM

pragma HLS data_pack variable=weights struct_level

pragma HLS data_pack variable=mult struct_level