HLS Systolic Array Comparison

cucapra / calyx-resource-eval

Resource Usage Evaluation for Calyx (& its Frontends)

0 stars 0 forks source link

HLS Systolic Array Comparison #10

Closed calebmkim closed 12 months ago

calebmkim commented 1 year ago

What we have

I got the systolic array HLS example working. I've gotten resource estimates for a 4x4 example: however, you have to manually look inside the results folder to look at the resource numbers (i.e., no automatic extraction).

What we want

HLS example should be more similar to Calyx

This requires two things:

Making it fixed point (will hopefully be easy)
Their example (which can be found in this github repo from Xilinx) takes arrays as parameters to the function. It then reads both input arrays into local, partitioned arrays, and then writes to an output array. To make the comparison more similar, we should probably not do this? (i.e., no reading the input arrays to local paritioned arrays and then writing to an output array. The function should just do the computation)

Making Calyx Systolic Arrays better so the comparison looks better

Looking at the initial resource estimates (this could definitely be wrong, since I'm just glancing quickly at one example), the luts and DSPs look comparable. But it looks like we use a lot of registers. I think partly this is because the systolic array writes to a register to check whether idx of the loop is between a certain value. This goes back to the issue raised here: if we could have combinational logic as static if guards, I think we could reduce our register usage by a decent amount: instead of needing a register to tell whether the index is between (for example) 3 and 7, we could just use combinational logic.

rachitnigam commented 1 year ago

Awesome, thanks for digging into this @calebmkim!

Yes, we should probably change the function to just take the partitioned arrays in as input and perform the computation immediately.
Interesting point about register usage. I am also curious about whether we can isolate the reason for this. Our systolic array is dynamic in the k dimension which should affect some of this as well but the extra registers allocated for idx are definitely a problem. Maybe another thing to do is use the -b resources backend to get a sense of how many registers we end up using after compilation and seeing if a particular pass is to blame.

sampsyo commented 1 year ago

Indeed; sounds great! Thanks for noticing the thing about reading/writing to memories. We certainly want to "equalize" these between the Calyx and HLS versions for fairness, and your proposal (make the HLS version match the Calyx version rather than the other way around) seems about right to me. One caveat is that, IIRC, the HLS tool is unlikely to work when "exposing" arrays partitioned with #pragma HLS array_partition or similar to the outside world; it may be necessary to "manually" partition these arrays to include them in the external interface. This may be why the example code doesn't expose partitioned arrays and instead internally does the copy from flat external arrays to internal partitioned arrays. If mucking around with the HLS version proves too annoying, there is always the alternative option of adding the array-copy stage on the Calyx side to achieve fairness from the opposite direction.

About the desire for combinational logic in static if guards: we could do a very hacky thing and advance https://github.com/cucapra/calyx/issues/1699 by simply turning off the problematic stability check that we believe to be flawed anyway. Then the static if could use unconditional continuous assignments, as you proposed in https://github.com/cucapra/calyx/issues/1699#issuecomment-1744889205.

calebmkim commented 1 year ago

Luckily, the way the vitis output divides cycle estimates into sections, so we can see exactly how long it thinks the actual computation takes:

Comparison

Unfortunately (for us) their matrix multiply algorithm takes ~3x fewer cycles (this ratio scales consistently). Examining their setup, I don't think it would be too difficult to change the Calyx implementation to copy them and thereby reduce cycle our cycle count.

However, their setup is a bit weird: if you look at their algorithm (I also copied it below to make things easier to look at) their "PEs" (i.e., the things that are accumulating the output values) aren't really passing data to each other: they just access the (banked) input memories to get values... I think this is part of the reason their register usage is a lot lower than us: we have registers to pass values among PEs, whereas their "PEs" always just read from memory.

systolic1:
    for (int k = 0; k < a_col; k++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
    systolic2:
        for (int i = 0; i < MAX_SIZE; i++) {
#pragma HLS UNROLL
        systolic3:
            for (int j = 0; j < MAX_SIZE; j++) {
#pragma HLS UNROLL
                // Get previous sum
                int last = (k == 0) ? 0 : localC[i][j];

                // Update current sum
                // Handle boundary conditions
                int a_val = (i < a_row && k < a_col) ? localA[i][k] : 0;
                int b_val = (k < b_row && j < b_col) ? localB[k][j] : 0;
                int result = last + a_val * b_val;

                // Write back results
                localC[i][j] = result;
            }
        }
    }

sampsyo commented 1 year ago

Hmm… that's an interesting observation. This is probably already encoded in your summary above, but to be explicit about it, I note that the localC array is "completely" partitioned:

// local matrices localA and localB have been partitioned in dimensions
// 1 and 2 respectively. local matrix C has been partitioned completely

Meaning that it actually behaves like $n^2$ individual registers instead of a real memory. That, of course, is just the "output-stationary" results; you are right that the accesses to A and B seem to use direct connections to banked input memories.

The thing that's confusing me about this is this: it seems like the A and B arrays are both partitioned into MAX_SIZE banks. It seems like this means that the total bandwidth the array can read on every cycle is only MAX_SIZE + MAX_SIZE elements, right? But there are MAX_SIZE * MAX_SIZE PEs. How can they all be reading a value on every cycle? Are we missing something and the arrays are actually partitioned completely (so they're also just $n^2$ registers)?

calebmkim commented 1 year ago

Here is a drawing of what I think is going on:

Just to be clear, the values I've drawn in the output array C (e.g., a * d), contains the value that is being accumulated to the existing "PE value" on each cycle. And I've drawn the partitions for both A and B. Essentially, each value that we access in both A and B is being broadcast across the entire row/column in C (which is completely partitioned such that it is essentially just n^2 registers as you note).

Observations

As noted above, the data isn't really being passed between PE's. In fact, looking back at their documentation illustration, arrows are absent between the PE's (they have "----" in between PE's rather than "--->").
In this scenario, all of the outputs become valid at the exact same time. In other words, all of the PE's are "finished" on the same iteration. This is potentially significant: In our (Calyx) setup, we can overlap the computation of the PE's with the writing to memory (since PE's are "finished" at different times) such that it only takes one extra cycle (after the last PE is "finished") to write to a (partitioned along 1 dimension) output memory. Using their setup, even if we partition the output memory along 1 dimension (which is what Calyx does) it is going to take an additional MAX_SIZE cycles to write to memory after PE's are finished. Of course, if the output memory is not partitioned along any dimensions, then writing is going to take MAX_SIZE * MAX_SIZE cycles anyways, so it doesn't matter.

Lmk if (2) doesn't make sense.

calebmkim commented 12 months ago

Dataflow doesn't match so we gave up on this comparison.