Improve resource allocation and scheduling

RaulMurillo commented 1 year ago

Hello everyone,

I am trying to improve the performance of the following kernel:

#define N_I  20
#define N_J  20

void kernel_hadamard(int ni, int nj, 
                 float alpha, float beta,
                 float C[N_I][N_J], float A[N_I][N_J], float B[N_I][N_J])
{
    int i, j;

    for (i = 0; i < ni; i++)
    {
        for (j = 0; j < nj; j++)
        {
            float tmp_a = alpha * A[i][j];
            float tmp_b = beta * B[i][j];
            C[i][j] = tmp_a * tmp_b;
        }
    }
}

I am synthesizing with -O3 for better performance. The report says it is using 2 DSP units. The problem is when I just synthesize a single multiplication as follows,

float my_mult(float x, float y)
{
    return x*y;
}

it also uses 2 DSPs. So I guess Bambu is not instantiating multiple multipliers in parallel for the previous kernel, although there is no data dependency at the computation of tmp_a and tmp_b, and neither between successive loop iterations.

My intuition is that, if more instances of multiplication were used in parallel (which would require more DSPs), the kernel latency should improve dramatically. Is there any way (by using directives, pragmas, Bambu options, etc.) to do this?

Ansaya commented 1 year ago

Hi Raul,

to generate the architecture you expect (with multiple multiplier units), the internal loop must be unrolled. You can do this using pragmas for the frontend compiler you selected. Furthermore, you should also consider the number of memory channels your design can exploit for parallel load/store operations: you may fully unroll the two for loops, but if you have only one memory channel this is not going to give you any benefit on the execution latency.

RaulMurillo commented 1 year ago

Thanks Michele,

Loop unrolling is done by the compiler optimization flag -O3, but I included specific pragmas anyway. I compile with the Bambu options

--channels-type=MEM_ACC_NN --memory-allocation-policy=EXT_PIPELINED_BRAM --channels-number=8

I could not obtain lower latency, even after modifying the number of channels. Also, the number of DSPs remains the same, so I guess I'm missing something and no parallel multiplications are being done.

Any other ideas or suggestions?

Ansaya commented 1 year ago

Hi Raul,

this is quite unusual to hear. What is the full command line that you are using to call bambu? Which frontend compiler have you chosen? I tried to perform a synthesis of the code you shared, and after adding #pragma unroll(N_J) to the internal loop of kernel_hadamard, I was able to observe a change in the generated kernel latency depending on the number of channels specified. I started with a baseline using the following command line:

babmu hadamard.c --top-fname=kernel_hadamard --generate-tb=hadamard.c --simulate --compiler=I386_CLANG16 -O2 --channels-number=1 --channels-type=MEM_ACC_11 --memory-allocation-policy=EXT_PIPELINED_BRAM

which gave me a 2462 cycles latency. Applying a full unroll to the internal loop, thus adding #pragma unroll(N_J) before the internal loop, I achieve a 1286 cycles latency since I still have a single memory channel which limits the amount of parallel load/store operations. Now, using --channels-number=8 --channels-type=MEM_ACC_NN, I got down to 346 cycles. Also, you may want to try something with pipelined floating-point units, which you can enable by using the -p=__float command line option. Finally, if you would like to have a look at the state transition graph to check the generated scheduling, you can use the --print-dot option, which will generate many .dot graphs in the HLS_output/dot directory: here you can find a folder for each one of the synthesized functions and in there the state transition graph file HLS_STGraph.dot (along with others).

RaulMurillo commented 1 year ago

Thanks for the explanation, Michele.

Firstly, I am using Bambu version 0.9.6 (this is because I did a modification on this version to work with additional floating-point units, but I am not currently using it in this example, so I guess such modification is not the source of the problem). Generally, I compile with GCC as follows:

bambu -v3 -O2 hadamard.c  --top-fname=kernel_hadamard \
      --generate-tb=test_file.xml \
      --print-dot --pretty-print=a.c \
      --compiler=I386_GCC8 \
      --simulate \
      --channels-number=1 --channels-type=MEM_ACC_11 --memory-allocation-policy=EXT_PIPELINED_BRAM

which gave me a 4462 cycles latency. Adding the #pragma GCC unroll 20 before the internal loop had no effect when using this compiler. I tried also with --compiler=I386_CLANG7, which gave me 4442 cycles), and adding loop unrolling gave me 4486 cycles. Also, I got no additional performance when increasing the number of memory channels.

Do you think this is related to the version of Bambu/compiler I am using?

fabrizioferrandi commented 1 year ago

Have you looked at the BB_FCFG.dot file generated by Bambu when --print-dot is passed? You will see what the compiler does once you add the unrolling pragma. We rely on the GCC/CLANG front end for such transformations. Another observation, --channels-type=MEM_ACC_11 allows only one memory operation per cycle. This may limit the performance even if you fully unroll your design.

fabrizioferrandi commented 1 year ago

Just an additional note. Try option --disable-function-proxy to allow as many floating point unit as needed by the unrolling,

RaulMurillo commented 1 year ago

Let me clarify a few things so that we are all on the same page:

I updated Bambu, so I am using the latest release (2023.1)
I am using Ubuntu 20.04 with the latest available Clang version in the official repositories (Clang-12)
The source code I am synthesizing is exactly the one from my first comment.
The original full command line I am using is the one from my previous comment https://github.com/ferrandi/PandA-bambu/issues/171#issuecomment-1609204329

Under this setup, I did multiple trials (without real success). Here are some findings:

The baseline (no loop unrolling, single memory channel, no extra options) gave me 4082 cycles.
Unrolling the baseline (with #pragma unroll(N_J) before the internal loop) gave me 4086 cycles. Quite bad considering the outcome of @Ansaya .
Now, using --channels-number=2 --channels-type=MEM_ACC_NN and no loop unrolling, I got down to 3682 cycles. This is exactly 400 fewer cycles than in the baseline case (and I have two nested loops of 20 iterations each). I believe that the internal alpha and beta multiplications are performed in parallel on this occasion, but just that. Also, after logic evaluation, I got the double DSP units, which supports my hypothesis.
With the #pragma unroll(N_J) plus the --channels-number=2 --channels-type=MEM_ACC_NN configuration, the latency became 3686 cycles (again, 400 fewer cycles than in the unrolled loop case, but not few enough)
I got no improvement when incrementing the channels-number from 2 to, for example, 8.
The -p=__float command line option gave me an 800 cycles reduction in all the cases.
The option --disable-function-proxy had no effect on the performance.

I still don't know why loop unrolling is not getting as good results as the ones @Ansaya got (50% cycle reduction by just unrolling, and much more with memory improvements). By the way, @fabrizioferrandi I looked at the BB_FCFG.dot file with and without the unrolling pragma, and it definitely modifies the compiler behavior.

Ansaya commented 1 year ago

Hi Raul, I forgot to mention the following: by default, Bambu generates a single instance of a function even if multiple call points are there; floating-point units are treated as functions as well, thus they are not duplicated. In your case, you have multiple function calls to the floating-point multiplier module, which is still a single functional unit. To duplicate it, you may pass the --fp-format=inline-math flag, which generates a dedicated functional unit for each call point.

RaulMurillo commented 1 year ago

Hi Michele, Thank you for the explanation. That is exactly what I supposed it was happening. It seems that the flag you mentioned yields much better performance, but still not as expected. When including --fp-format=inline-math in the previous Bambu options, I got 3282 cycles and 18 DSPs (instead of 2) in the unrolled case. However, using #pragma unroll(N_J) and 1 memory channel produced 3286 cycles and 128 DSPs, so no real benefit in terms of delay. Even when increasing the number of channels and their type, the number of cycles is reduced to just 2886, so it seems Bambu is actually not leveraging the loop unrolling.

May I ask for the full command line that you are using to obtain the small amount of 346 cycles? Did you do any modifications to the source code apart from the pragma unroll?

RaulMurillo commented 1 year ago

Another problem I found is when including the option --flopoco (which is required by the real experiment I am doing), the benefits from the --fp-format=inline-math flag disappear.

ferrandi / PandA-bambu

Improve resource allocation and scheduling #171