ferrandi / PandA-bambu

PandA-bambu public repository
GNU General Public License v3.0
223 stars 44 forks source link

Improve resource allocation and scheduling #171

Open RaulMurillo opened 1 year ago

RaulMurillo commented 1 year ago

Hello everyone,

I am trying to improve the performance of the following kernel:

#define N_I  20
#define N_J  20

void kernel_hadamard(int ni, int nj, 
                 float alpha, float beta,
                 float C[N_I][N_J], float A[N_I][N_J], float B[N_I][N_J])
{
    int i, j;

    for (i = 0; i < ni; i++)
    {
        for (j = 0; j < nj; j++)
        {
            float tmp_a = alpha * A[i][j];
            float tmp_b = beta * B[i][j];
            C[i][j] = tmp_a * tmp_b;
        }
    }
}

I am synthesizing with -O3 for better performance. The report says it is using 2 DSP units. The problem is when I just synthesize a single multiplication as follows,

float my_mult(float x, float y)
{
    return x*y;
}

it also uses 2 DSPs. So I guess Bambu is not instantiating multiple multipliers in parallel for the previous kernel, although there is no data dependency at the computation of tmp_a and tmp_b, and neither between successive loop iterations.

My intuition is that, if more instances of multiplication were used in parallel (which would require more DSPs), the kernel latency should improve dramatically. Is there any way (by using directives, pragmas, Bambu options, etc.) to do this?

Ansaya commented 1 year ago

Hi Raul,

to generate the architecture you expect (with multiple multiplier units), the internal loop must be unrolled. You can do this using pragmas for the frontend compiler you selected. Furthermore, you should also consider the number of memory channels your design can exploit for parallel load/store operations: you may fully unroll the two for loops, but if you have only one memory channel this is not going to give you any benefit on the execution latency.

RaulMurillo commented 1 year ago

Thanks Michele,

Loop unrolling is done by the compiler optimization flag -O3, but I included specific pragmas anyway. I compile with the Bambu options

--channels-type=MEM_ACC_NN --memory-allocation-policy=EXT_PIPELINED_BRAM --channels-number=8

I could not obtain lower latency, even after modifying the number of channels. Also, the number of DSPs remains the same, so I guess I'm missing something and no parallel multiplications are being done.

Any other ideas or suggestions?

Ansaya commented 1 year ago

Hi Raul,

this is quite unusual to hear. What is the full command line that you are using to call bambu? Which frontend compiler have you chosen? I tried to perform a synthesis of the code you shared, and after adding #pragma unroll(N_J) to the internal loop of kernel_hadamard, I was able to observe a change in the generated kernel latency depending on the number of channels specified. I started with a baseline using the following command line:

babmu hadamard.c --top-fname=kernel_hadamard --generate-tb=hadamard.c --simulate --compiler=I386_CLANG16 -O2 --channels-number=1 --channels-type=MEM_ACC_11 --memory-allocation-policy=EXT_PIPELINED_BRAM

which gave me a 2462 cycles latency. Applying a full unroll to the internal loop, thus adding #pragma unroll(N_J) before the internal loop, I achieve a 1286 cycles latency since I still have a single memory channel which limits the amount of parallel load/store operations. Now, using --channels-number=8 --channels-type=MEM_ACC_NN, I got down to 346 cycles. Also, you may want to try something with pipelined floating-point units, which you can enable by using the -p=__float command line option. Finally, if you would like to have a look at the state transition graph to check the generated scheduling, you can use the --print-dot option, which will generate many .dot graphs in the HLS_output/dot directory: here you can find a folder for each one of the synthesized functions and in there the state transition graph file HLS_STGraph.dot (along with others).

RaulMurillo commented 1 year ago

Thanks for the explanation, Michele.

Firstly, I am using Bambu version 0.9.6 (this is because I did a modification on this version to work with additional floating-point units, but I am not currently using it in this example, so I guess such modification is not the source of the problem). Generally, I compile with GCC as follows:

bambu -v3 -O2 hadamard.c  --top-fname=kernel_hadamard \
      --generate-tb=test_file.xml \
      --print-dot --pretty-print=a.c \
      --compiler=I386_GCC8 \
      --simulate \
      --channels-number=1 --channels-type=MEM_ACC_11 --memory-allocation-policy=EXT_PIPELINED_BRAM

which gave me a 4462 cycles latency. Adding the #pragma GCC unroll 20 before the internal loop had no effect when using this compiler. I tried also with --compiler=I386_CLANG7, which gave me 4442 cycles), and adding loop unrolling gave me 4486 cycles. Also, I got no additional performance when increasing the number of memory channels.

Do you think this is related to the version of Bambu/compiler I am using?

fabrizioferrandi commented 1 year ago

Have you looked at the BB_FCFG.dot file generated by Bambu when --print-dot is passed? You will see what the compiler does once you add the unrolling pragma. We rely on the GCC/CLANG front end for such transformations. Another observation, --channels-type=MEM_ACC_11 allows only one memory operation per cycle. This may limit the performance even if you fully unroll your design.

fabrizioferrandi commented 1 year ago

Just an additional note. Try option --disable-function-proxy to allow as many floating point unit as needed by the unrolling,

RaulMurillo commented 1 year ago

Let me clarify a few things so that we are all on the same page:

Under this setup, I did multiple trials (without real success). Here are some findings:

I still don't know why loop unrolling is not getting as good results as the ones @Ansaya got (50% cycle reduction by just unrolling, and much more with memory improvements). By the way, @fabrizioferrandi I looked at the BB_FCFG.dot file with and without the unrolling pragma, and it definitely modifies the compiler behavior.

Ansaya commented 1 year ago

Hi Raul, I forgot to mention the following: by default, Bambu generates a single instance of a function even if multiple call points are there; floating-point units are treated as functions as well, thus they are not duplicated. In your case, you have multiple function calls to the floating-point multiplier module, which is still a single functional unit. To duplicate it, you may pass the --fp-format=inline-math flag, which generates a dedicated functional unit for each call point.

RaulMurillo commented 1 year ago

Hi Michele, Thank you for the explanation. That is exactly what I supposed it was happening. It seems that the flag you mentioned yields much better performance, but still not as expected. When including --fp-format=inline-math in the previous Bambu options, I got 3282 cycles and 18 DSPs (instead of 2) in the unrolled case. However, using #pragma unroll(N_J) and 1 memory channel produced 3286 cycles and 128 DSPs, so no real benefit in terms of delay. Even when increasing the number of channels and their type, the number of cycles is reduced to just 2886, so it seems Bambu is actually not leveraging the loop unrolling.

May I ask for the full command line that you are using to obtain the small amount of 346 cycles? Did you do any modifications to the source code apart from the pragma unroll?

RaulMurillo commented 1 year ago

Another problem I found is when including the option --flopoco (which is required by the real experiment I am doing), the benefits from the --fp-format=inline-math flag disappear.