Closed calebmkim closed 12 months ago
Awesome, thanks for digging into this @calebmkim!
k
dimension which should affect some of this as well but the extra registers allocated for idx
are definitely a problem. Maybe another thing to do is use the -b resources
backend to get a sense of how many registers we end up using after compilation and seeing if a particular pass is to blame.Indeed; sounds great! Thanks for noticing the thing about reading/writing to memories. We certainly want to "equalize" these between the Calyx and HLS versions for fairness, and your proposal (make the HLS version match the Calyx version rather than the other way around) seems about right to me. One caveat is that, IIRC, the HLS tool is unlikely to work when "exposing" arrays partitioned with #pragma HLS array_partition
or similar to the outside world; it may be necessary to "manually" partition these arrays to include them in the external interface. This may be why the example code doesn't expose partitioned arrays and instead internally does the copy from flat external arrays to internal partitioned arrays. If mucking around with the HLS version proves too annoying, there is always the alternative option of adding the array-copy stage on the Calyx side to achieve fairness from the opposite direction.
About the desire for combinational logic in static if
guards: we could do a very hacky thing and advance https://github.com/cucapra/calyx/issues/1699 by simply turning off the problematic stability check that we believe to be flawed anyway. Then the static if
could use unconditional continuous assignments, as you proposed in https://github.com/cucapra/calyx/issues/1699#issuecomment-1744889205.
Luckily, the way the vitis output divides cycle estimates into sections, so we can see exactly how long it thinks the actual computation takes:
Unfortunately (for us) their matrix multiply algorithm takes ~3x fewer cycles (this ratio scales consistently). Examining their setup, I don't think it would be too difficult to change the Calyx implementation to copy them and thereby reduce cycle our cycle count.
However, their setup is a bit weird: if you look at their algorithm (I also copied it below to make things easier to look at) their "PEs" (i.e., the things that are accumulating the output values) aren't really passing data to each other: they just access the (banked) input memories to get values... I think this is part of the reason their register usage is a lot lower than us: we have registers to pass values among PEs, whereas their "PEs" always just read from memory.
systolic1:
for (int k = 0; k < a_col; k++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
systolic2:
for (int i = 0; i < MAX_SIZE; i++) {
#pragma HLS UNROLL
systolic3:
for (int j = 0; j < MAX_SIZE; j++) {
#pragma HLS UNROLL
// Get previous sum
int last = (k == 0) ? 0 : localC[i][j];
// Update current sum
// Handle boundary conditions
int a_val = (i < a_row && k < a_col) ? localA[i][k] : 0;
int b_val = (k < b_row && j < b_col) ? localB[k][j] : 0;
int result = last + a_val * b_val;
// Write back results
localC[i][j] = result;
}
}
}
Hmm… that's an interesting observation. This is probably already encoded in your summary above, but to be explicit about it, I note that the localC
array is "completely" partitioned:
// local matrices localA and localB have been partitioned in dimensions
// 1 and 2 respectively. local matrix C has been partitioned completely
Meaning that it actually behaves like $n^2$ individual registers instead of a real memory. That, of course, is just the "output-stationary" results; you are right that the accesses to A and B seem to use direct connections to banked input memories.
The thing that's confusing me about this is this: it seems like the A and B arrays are both partitioned into MAX_SIZE
banks. It seems like this means that the total bandwidth the array can read on every cycle is only MAX_SIZE + MAX_SIZE
elements, right? But there are MAX_SIZE * MAX_SIZE
PEs. How can they all be reading a value on every cycle? Are we missing something and the arrays are actually partitioned completely (so they're also just $n^2$ registers)?
Here is a drawing of what I think is going on:
Just to be clear, the values I've drawn in the output array C (e.g., a * d), contains the value that is being accumulated to the existing "PE value" on each cycle. And I've drawn the partitions for both A
and B
. Essentially, each value that we access in both A
and B
is being broadcast across the entire row/column in C
(which is completely partitioned such that it is essentially just n^2
registers as you note).
MAX_SIZE
cycles to write to memory after PE's are finished.
Of course, if the output memory is not partitioned along any dimensions, then writing is going to take MAX_SIZE * MAX_SIZE
cycles anyways, so it doesn't matter. Lmk if (2) doesn't make sense.
Dataflow doesn't match so we gave up on this comparison.
What we have
I got the systolic array HLS example working. I've gotten resource estimates for a 4x4 example: however, you have to manually look inside the results folder to look at the resource numbers (i.e., no automatic extraction).
What we want
HLS example should be more similar to Calyx
This requires two things:
Making Calyx Systolic Arrays better so the comparison looks better
Looking at the initial resource estimates (this could definitely be wrong, since I'm just glancing quickly at one example), the luts and DSPs look comparable. But it looks like we use a lot of registers. I think partly this is because the systolic array writes to a register to check whether
idx
of the loop is between a certain value. This goes back to the issue raised here: if we could have combinational logic asstatic if
guards, I think we could reduce our register usage by a decent amount: instead of needing a register to tell whether the index is between (for example) 3 and 7, we could just use combinational logic.