dynamic fixed point - Githubissues

nhanvtran commented 6 years ago

test out the gains of this modified data type

GiuseppeDiGuglielmo commented 6 years ago

The goal is to check the pros and cons of a custom fixed-point representation that has dynamic precision. Can you point to me the example-prjs I should try first?

Thank you

nhanvtran commented 6 years ago

Hi @GiuseppeDiGuglielmo

The 1 layer FC model is here: https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/keras-to-hls/example-keras-model-files/KERAS_1layer.json https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/keras-to-hls/example-keras-model-files/KERAS_1layer_weights.h5

zhenbinwu commented 6 years ago

@GiuseppeDiGuglielmo Can you please send me your example? I am very interested in your method. I would like to sort it out before the Xilinx meeting. Thanks a lot!

GiuseppeDiGuglielmo commented 6 years ago

@zhenbinwu

I will integrate the dfixed type in the hls4ml flow on the layer @nhanvtran suggested.

My plan is - at the beginning - to use the dfixed with hardcoded NBIT and IBIT. The QoR will be the same as the ap_fixed. I will then figure out if and how to add an additional port to the KERAS_1layer to control the precision of the dfixed at runtime. The integration in my project was easy, but it may take me some time for integrating it in hls4ml. I count to do it at the beginning of the next week (before Xilinx meeting for sure).

This is the code, please feel free to experiment and let us know.

Put this in your headers:

// types.h
#ifndef INC_TYPES_H
#define INC_TYPES_H

#if defined(FIX16)
#include "ap_int.h"
#define NBIT 16
#define IBIT 8
typedef ap_fixed<NBIT, IBIT> FPDATA;
const char PRECISION_STR[] = "Fixed Point 16";

#elif defined(DFIX16)
#include "ap_int.h"
#define NBIT 16
#define IBIT 8
typedef ap_int<NBIT> FPDATA;
const char PRECISION_STR[] = "Dynamic Fixed Point 16";

FPDATA dfix_add(FPDATA a, FPDATA b);
FPDATA dfix_mul(FPDATA a, FPDATA b); // Hardcoded precision. It is the same as ap_fixed.
FPDATA dfix_mul(FPDATA a, FPDATA b, unsigned char ibit); // Dynamic precision.
#endif
#endif  // INC_TYPES_H

The implementation of the add and mul. This is sufficient for a CNN.

// types.c
#include "data_types.h"

#if defined(DFIX16)

FPDATA dfix_add(FPDATA a, FPDATA b) {
#pragma HLS INLINE
    return a + b;
}

// Hardcoded precision. Same QoR of Xilinx ap_fixed<NBIT, IBIT>
FPDATA dfix_mul(FPDATA a, FPDATA b) {
#pragma HLS INLINE
    ap_int<NBIT*2> extended_a = a;
    ap_int<NBIT*2> extended_b = b;
    ap_int<NBIT*2> extended_result = extended_a * extended_b;
    return (extended_result >> (NBIT-IBIT));
}

// Dynamic precision. We can control it from the top module interface or in some inner logic of the module using it.
FPDATA dfix_mul(FPDATA a, FPDATA b, unsigned char ibit) {
#pragma HLS INLINE
    ap_int<NBIT*2> extended_a = a;
    ap_int<NBIT*2> extended_b = b;
    ap_int<NBIT*2> extended_result = extended_a * extended_b;
    return (extended_result >> (NBIT-ibit));
}
#endif

Finally, these are some conversion functions that I use in the testbench (not in the synthesizable code). For test and validation, I usually convert

from float to ap_fixed, and then from ap_fixed to the dfixed (see below)
from dfixed to ap_fixed (see below), and then from ap_fixed to float

#if defined(DFIX16)
void to_dfixed(FPDATA &to, ap_fixed<NBIT, IBIT> from) {
  to = from.range(NBIT-1, 0);
}

void from_dfixed(ap_fixed<NBIT, IBIT> &to, FPDATA from) {
  to.range(NBIT-1, 0) = from;
}
#endif

P.S. You may have to adapt a little the code above for your experiments. In particular, this solution does not allow you to control the quantization and overflow mode as ap_fixed does. This is definitely something to discuss about with Xilinx. We can also come up with some custom solutions if you really need a certain type of overflow or quantization.

zhenbinwu commented 6 years ago

Thanks for the example! May I ask what does range() do? I can't find its description online for ap_fixed.

I tried to test your code in https://github.com/zhenbinwu/HLSArena/tree/master/DynamicFixed. But I am having problem to create reasonable result.

Dynamic Fixed Point 16: a 1 , b 2
add 3 mul 0 newmul -7.625
Fixed Point 16: a 1 , b 
add 3 mul 0 newmul -71.625

where mul is directly from the function and newmul is after conversion.

What am I missing? Thank you very much!

GiuseppeDiGuglielmo commented 6 years ago

@zhenbinwu,

The functions from_dfixed and to_dfixed convert a fixed-point representation into a variable of ap_int data type and vice versa. I need this conversion to integer because the dfix_mul and dfix_add functions operate on ap_int rather than ap_fixed.

Let's say we have:

ap_fixed<8,4> var_fxd = 1.5;

the bit representation of the value stored in var_fxd is 00011000, where the first four bits (0001) encodes the integer part 1 and the remaining four bits (1000) encode the fractional part (0.5). I am using the range function to return the raw bit representations 00011000 and then assign it to an unsigned for example:

ap_int<8> var_int = var_fxd.range(7,0);

Please, note that ap_int provides a function to_int() but it does only return the integer part of the fixed-point variable.

Let me know if it does make sense.

I will have a look at your code and post an answer here ASAP.

GiuseppeDiGuglielmo commented 6 years ago

@zhenbinwu ,

I edited a little your code. I have a working example of the "dynamic fixed point" (C-simulation, synthesis, and C-RTL-cosimulation). Can I push it or the HLSArena repository?

Sorry for the delay, Giuseppe

zhenbinwu commented 6 years ago

@GiuseppeDiGuglielmo Thanks for the explanation! The HLSArena is a temp repo for some tests. I added you as collaborator so you can push.

GiuseppeDiGuglielmo commented 6 years ago

I pushed the code, these are some results for the toy example:

Arch        TargetClk   EstimatedClk    Latency DSPs    FFs     LUTs
FIX32       2.00        1.595           5       4       269     94
DFIX32      2.00        1.595           5       4       253     78
DFIX32-VP   2.00        1.595           6       4       336     503

FIX32 uses ap_fixed<32,16>;
DFIX32 uses the proposed integer representation, but the precision is hard-coded;
DFIX32-VP uses the proposed integer representation, but with a variable precision ibit=16.

GiuseppeDiGuglielmo commented 6 years ago

@nhanvtran, @zhenbinwu

A short update regarding the example project in https://github.com/hls-fpga-machine-learning/hls4ml/tree/master/keras-to-hls I generated the project with:

python keras-to-hls.py -c keras-config.yml

and momentarily and for testing purpose I manually edited all of the library dependencies such that the tesbench and my-hls-test/firmware/myproject.cpp could use transparently the custom fixed point.

    layer1_t layer1_out[OUT_HEIGHT_1*OUT_WIDTH_1*N_FILT_1];
    #pragma HLS ARRAY_PARTITION variable=layer1_out complete dim=0
    layer1_t conv2d_layer1_out[OUT_HEIGHT_1][OUT_WIDTH_1][N_FILT_1];
    #pragma HLS ARRAY_PARTITION variable=conv2d_layer1_out complete dim=0
    nnet::conv_2d<input_t, layer1_t, config1>(data, conv2d_layer1_out, w1, b1);
    layer1_t logits1[OUT_HEIGHT_1*OUT_WIDTH_1*N_FILT_1];
    #pragma HLS ARRAY_PARTITION variable=logits1 complete dim=0
    nnet::flatten<layer1_t, OUT_HEIGHT_1, OUT_WIDTH_1, N_FILT_1>(conv2d_layer1_out, logits1);
    nnet::relu<layer1_t, layer1_t, relu_config1>(logits1, layer1_out);

    result_t logits2[N_OUTPUTS];
    #pragma HLS ARRAY_PARTITION variable=logits2 complete dim=0
    nnet::compute_layer<layer1_t, result_t, config2>(layer1_out, logits2, w2, b2);
    nnet::softmax<result_t, result_t, softmax_config2>(logits2, res);

The function conv2d, relu, and _computelayer were easy to support. I have to work a little more on the softmax. I may not have time before this Friday, Sept 7. I hope the example in the HLSArena is sufficient for our discussion.

I will keep you posted.

GiuseppeDiGuglielmo commented 6 years ago

I found some time and ported the following library functions to the user-defined fixed-point:

nnet::conv_2d
nnet::relu
nnet::compute_layer
nnet::softmax

The experiment configuration file is:

KerasJson: example-keras-model-files/KERAS_conv2d_model.json
KerasH5:   example-keras-model-files/KERAS_conv2d_model_weights.h5
OutputDir: my-hls-test
ProjectName: myproject
XilinxPart: xc7vx690tffg1927-2
ClockPeriod: 5

IOType: io_parallel # options: io_serial/io_parallel
ReuseFactor: 1
DefaultPrecision: ap_fixed<18,8>

I used Vivado 2017.2.

ap_fixed and the custom fixed (with hardcoded precision) provide the same QoR and are functionally equivalent:

Arch        TargetClk   EstimatedClk   Latency DSPs       FFs        LUTs
FIX18_8     5.00        4.33           16      1803       144747     75190
DFIX18_8    5.00        4.33           16      1803       144747     75190

@zhenbinwu,

If we are happy with this, I can refine library code and implement the support into the code generator. At that point we may consider to evaluate the variable precision overhead that we observed in the HLSArena code for DFIX32-FP. Let me know your opinion.
I think we can address the resources blow up for ap_fixed saturation mode by implementing own saturation mode on top of the functions in the HLSArena. What do you think?

nhanvtran commented 6 years ago

@GiuseppeDiGuglielmo, thanks for these tests

It's great to see that moving to DFIX reproduces the fixed point exactly. In a previous example, the resources with DFIX-VP helps with the saturation but introduces higher resources. I have two higher level questions

we've often seen LUTs overestimated by quite a lot in HLS vs. Vivado implementation. I wonder if that will make up the diference between DFIX/FIX vs. DFIX-VP
there is an interesting trade-off of DFIX-VP vs. just increasing the bits in DFIX by a few. That might be interesting to understand the resource benefits. This probably depends on how close you are to the DSP bit precision boundaries.

zhenbinwu commented 6 years ago

@GiuseppeDiGuglielmo Sorry for getting back to this late!

I took a great interest in your example, since it can convert ap_int to ap_fixed correctly. This problem bothers me quite a while until I learned the magic function range() from you! With your example, I finally got the multipumping code to work with ap_fixed. It is now in the HLSArena. It helps with reducing the resource usage. Thank you very much for your example!

Regarding the dynamic floating point, it is not clear to me yet. I assume we can manage the dynamic decimal point for high precision or large integer value. From my understanding, the high precision won't be needed for ML (as floating point to 8bit example). For the larger integer value, hopefully saturation mode could work automatically. But we will see.

BTW, how do you decide the dynamic IBIT? Is it during run time or compile time? Thanks

GiuseppeDiGuglielmo commented 6 years ago

@nhanvtran, I may run some experiments and get back with some answers to your questions.

@zhenbinwu, I am glad that you found the example useful :-)

The dynamic-fixed point was meant to be used in the context of a single conv2d accelerator that is used for the various and different layers of a CNN. In that case, the ibit is passed to the conv2d together with other configuration parameters of each layer. The value of ibit can be statically computed for a specific CNN (compile time).

In the context of HLS4ML, the dynamic fixed point may be useful if we add additional logic to control the ibit value. The extra logic overhead has to be lower than the use of more traditional ap_fixed and its overflow and quantization modes.

If you guys have more idea of an application of it for HLS4ML, I will be glad to hear it.

zhenbinwu commented 6 years ago

@GiuseppeDiGuglielmo I see. Yeah, I think dynamic fixed point is definitely interesting for hls4ml. With careful design, we can split some operation as ap_int type operation, which can be carried out in FF/LUT. This might help with the implementation design for managing resource.

We have been talking about different ap_fixed type for different layer. Maybe it can be achieved easier with the dynamic-fixed point. Anyway, I think it is an interesting application.

zhenbinwu commented 5 years ago

Hi @GiuseppeDiGuglielmo, I implemented the overflow for the from_dfixed() from your dynamic fixed point example. You can find the code here. It should work, but does cause extra resource and latency in my test. Just to let you know to avoid duplicate effort. I will show more on next meeting.

GiuseppeDiGuglielmo commented 5 years ago

@zhenbinwu, thank you for sharing! I will give it a run before the next meeting, but it looks like you already squeeze it. In case, I will let you know.

jeinstei commented 5 years ago

I'm a bit worried about timing when using fixed-point. I know that we've used the VHDL fixed-point library and some others, but wouldn't hls4ml need to take in to account the dynamic variation in cycles if using a dynamic fixed point?

GiuseppeDiGuglielmo commented 5 years ago

@jeinstei sorry for the late reply,

I am not sure what you mean with a dynamic variation in cycles. Given a fixed word size, I dynamic precision should not introduce a variable latency. Can you please make me an example?

We proposed the dynamic fixed point for the situation of a single accelerator (e.g. conv2d) that serves the multiple layers of a DNN (e.g. CNN). The goal is to reduce as much as we can the word size and having a variable precision (integer vs. fractional part).

// Dynamic precision. We can control it from the top module interface or in some inner logic of the module using it.
FPDATA dfix_mul(FPDATA a, FPDATA b, unsigned char ibit) { // ibit is a free variable of the system
#pragma HLS INLINE
    ap_int<NBIT*2> extended_a = a;
    ap_int<NBIT*2> extended_b = b;
    ap_int<NBIT*2> extended_result = extended_a * extended_b;
    return (extended_result >> (NBIT-ibit));
}

You can imagine the life cycle of the system* that use this dynamic precision: the software configures and runs the first layer with a certain precision, then it configures and runs the second layer etc.

The system is heterogeneous because of the combination of FPGA and SW.

In HLS4ML, the entire model (logic and weights) fits the FPGA: "HLS4ML unrolls the model in space" (while the approach above is an "unrolling in time"). Again, HLS4ML is (as far as I have seen) purely in FPGA. Theoretically, each layer may have its own hardcoded precision (ap_fixed) without necessary using the "dynamic fixed point".

fastmachinelearning / hls4ml

dynamic fixed point #92