cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
319 stars 93 forks source link

Support for non-blocking read and write #282

Open hecmay opened 3 years ago

hecmay commented 3 years ago

The current on-chip data movement primitive always creates a blocking FIFO channel between consumers and producers. We should also provide an option for non-blocking. It is easy to replace the blocking FIFO read/write APIs to non-blocking counterparts in the HLS code generation, but the correctness of the function cannot be guaranteed since the FIFO might not access data successfully...

// blocking case
int read = ch.read();

// non-blocking case
int read;
ch.read_nb(read);

In the practical use cases, the non-blocking read or write operations often come along with the if-then-else condition (i.e. if read or write FIFO succeeds, then do something, otherwise, jump to the else branch). However, this programming style highly couples the communication patterns with algorithms, which is the opposite of what we want...

hls::stream<int> ch;
ch.write(data);

int temp;
if (ch.read_nb(temp)) {
   // FIFO not empty...
} else {
   // empty FIFO...
}

Here comes the problem: how should we handle the non-blocking data movement while still maintaining the decoupled optimization interface? One possible solution is to insert an IF statement right after the FIFO access:

int read;
bool val = ch.read_nb(read);
if (val) {
    // ... following consumer stages 
}

But this may also cause some problems when the non-blocking reading happens inside a loop nest. E.g. if the read value should be written to an array indexed by the loop iterators. Any advice on how we should handle it? @zhangzhiru @seanlatias

hecmay commented 3 years ago

The Insider compiler requires the program to be constructed with blocking write and non-blocking read. The computation is only performed after the non-blocking read succeeds: https://github.com/zainryan/INSIDER-System/blob/master/apps/device/knn/kernels/app_knn.cpp

I think I can use the workaround mentioned above to solve the issue. The users should be responsible for the correctness if they want to use non-blocking read and write.

zhangzhiru commented 3 years ago

However, this programming style highly decouples the communication patterns with algorithms, which is the opposite of what we want.

@Hecmay what do you mean by this coding style decouples communication from algorithm? Isn't it the opposite?

hecmay commented 3 years ago

Sorry, It's a typo. Yeah, I mean the non-blocking operations in the real-world applications cannot be easily decoupled from the algorithm.

zhangzhiru commented 3 years ago

For nonblocking accesses, we will have to explicitly add new APIs similar to nb_read, nb_write.

Any quick workaround to make Insider use blocking read instead?

hecmay commented 3 years ago

Maybe we can have a try. From the Insider README:

due to the potential bug in Xilinx Vivado HLS, please insists using non-blocking read read_nb and blocking write write to operate Insider queues

zhangzhiru commented 3 years ago

Is there a mention of the Vivado HLS version#? Anyway, hope the problem has already been resolved.

hecmay commented 3 years ago

Using separate read_nb() or write_nb() APIs in HeteroCL does not seem to be the right approach since it highly coupled with the algorithm, and we want a decoupled communication interface that has no assumption on the communication patterns before scheduling.

I think it is still possible to decouple non-blocking ops from the algorithmic spec, though there are some limitations, admittedly. Here is an simple program for illustration — Tensor B reads from tensor A, and tensor C does not depend on tensor A or B. We want to create non-blocking read and write channels between A and B.

def kernel(inputs):
    A = hcl.compute(shape, func(inputs), "A")
    B = hcl.compute(shape, func(A), "B")
    C = hcl.compute(shape, lambda *args: 0)
  1. Non-blocking read. We can add an option in the API to specify what operations need to be done when the non-blocking read fails. In this case, the tensor C is computed when the reading fails. In the generated HLS code, we can use an if-then-else statement to express the logic.
# python schedule
s.to(inputs, kernel.A, read_nb=True, read_fail=[kernel.C])

// Generated HLSC code
float A_data;
if (A_channel.read_nb(A_data)) {
    // all the dependent stages attached here...
} else {
    // insert the stages specifed by `read_fail` option
    // compiler needs to check there is data depedency to A in this scope
}
  1. Non-blocking write. If the write_fail option is left empty, the else branch will be removed, indicating that nothing will be executed when the blocking write operation fails.
# python schedule
s.to(inputs, kernel.A, write_nb=True, write_fail=[kernel.C])

if (A_channel.write_nb(data)) {
    // all the dependent stages attached here 
} else {
    // insert the stages specifed by `write_fail` option
    // compiler needs to check there is data depedency to A in this scope    
}
  1. When there is a mix of non-blocking read and write operations, we can basically create a nested if-then-else statements. This approach minimize the coupling between communication customization and algorithmic specification, but it is not very intuitive from programmers' perspective (but I do not think this to be a blocking issue, since the non-blocking control logic in realistic applications are usually not very complicated).
zhangzhiru commented 3 years ago

Interesting idea. You're basically trying to support exception handler in .to(). In fact, kernel C should not be explicitly called in the original program. I thought about supporting try-catch to handle nonblocking accesses.

There are some pros and cons -- compared to having explicit nonblocking reads/writes, we cannot directly use the empty/full conditions, meaning that there is no easy way to do AND/OR of the conditions of multiple channels.

Let's do more brainstorming here. I'm leaning towards supporting the exception style of declarative code and allowing read_nb in the imperative code.