Questions of measuring the latency and using the FIFO stream

tangyuelm commented 1 year ago

Hello,

I have two more questions and hope to get your support. One is about latency measurement. For example, when I run your bandwidth example, I first try the traditional std::chrono::high_resolution_clock, but get the latency of around 1.24e9ns when n=1024*1024, which seems abnormal. Then, I use the way that is used in the vadd example and get the latency of around 1.16e7ns. The two ways are shown as follows. It seems that the two ways are similar but the results are different. Could you provide a few explanations?

// double kernel_time_in_sec = 0; // std::chrono::duration kernel_time(0); // auto kernel_start = std::chrono::high_resolution_clock::now(); int64_t kernel_time_ns = tapa::invoke(Bandwidth, FLAGS_bitstream, tapa::read_write_mmaps<float, kBankCount>(chan) .vectorized(), n, flags);

// Stop timer // auto kernel_end = std::chrono::high_resolution_clock::now(); // kernel_time = std::chrono::duration(kernel_end - kernel_start); // kernel_time_in_sec = kernel_time.count(); // clog << "Execution time = " << kernel_time_in_sec <<" s; "<<kernel_time_in_sec1e9<<" ns "<<endl; clog << "kernel time: " << kernel_time_ns 1e-9 << " s; "<<kernel_time_ns<<" ns "<<endl;

The second problem is about using the FIFO. I used the stream to define a FIFO for dataflow shown as follows. tapa::stream<tapa::vec_t<float, 16>, 32> OFMstream("OFMstream"); Based on the tutorial, the depth needs to be pre-defined. However, during CSim, I found that increasing the depth will cause the loss of data while decreasing the depth will cause a long simulation time. Then, I assign the depth as 32 which works well in CSim. However, when I test it on board, the loss of data issue happens again. Do you know why it could cause such a problem? In the original HLS, the depth does not need to be assigned, so I am wondering how to find a suitable depth in tapa?

Best Wishes, Yue

tangyuelm commented 1 year ago

I also try the jacobi example. When I use the original shape with width = 100 and height = 100, the Csim succeeds. Then, I try a different shape (e.g. width =10 and height =10), the Csim fails.

Licheng-Guo commented 1 year ago

I suggest that you could try the vadd example, and start from there to implement your design.

You need to set the FIFO depth as appropriate to your specific design. You could refer to our FPGA'21 paper for some tips.

We've never seen a situation where increasing the FIFO depth will cause data loss, could you provide how to reproduce the issue?

@Blaok knows more about the timer and how to run the Jacobi example correctly.

Blaok commented 1 year ago

I have two more questions and hope to get your support. One is about latency measurement. For example, when I run your bandwidth example, I first try the traditional std::chrono::high_resolution_clock, but get the latency of around 1.24e9ns when n=1024*1024, which seems abnormal. Then, I use the way that is used in the vadd example and get the latency of around 1.16e7ns. The two ways are shown as follows. It seems that the two ways are similar but the results are different. Could you provide a few explanations?

The time returned by tapa::invoke depends on the device implementation. For OpenCL devices, this is the time between CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END for the enqueueNDRangeKernel events, which means it does not include time spent on host-device communication. The time measured using std::chrono is the end-to-end time, which includes host-device communication time.

The second problem is about using the FIFO. I used the stream to define a FIFO for dataflow shown as follows. tapa::stream<tapa::vec_t<float, 16>, 32> OFMstream("OFMstream"); Based on the tutorial, the depth needs to be pre-defined. However, during CSim, I found that increasing the depth will cause the loss of data while decreasing the depth will cause a long simulation time. Then, I assign the depth as 32 which works well in CSim. However, when I test it on board, the loss of data issue happens again. Do you know why it could cause such a problem?

Are you using non-blocking writes? Otherwise, there should not be any "loss of data".

In the original HLS, the depth does not need to be assigned, so I am wondering how to find a suitable depth in tapa?

If I remember correctly, Vitis HLS just gives you a default depth of 1, which is not a good idea in general, because such FIFOs cannot be fully pipelined and are just as expensive as ones with depth 32.

Then, I try a different shape (e.g. width =10 and height =10), the Csim fails.

The jacobi code does not implement tiling and the width must be exactly 100.

tangyuelm commented 1 year ago

Hi @Blaok and @Licheng-Guo,

Thanks for your reply and clear explanation. I now understand the latency measurement. For the FIFO, I followed the examples in jacobi using .write() and .read(), so they should be blocking writes and reads. However, I also used async_mmap since I also followed the examples in bandwidth for 512 bits DMA. I just converted it to synchronous read and write and now the Csim seems correct. Hope the on-board implementation can succeed as well.

Best Wishes, Yue

tangyuelm commented 1 year ago

Dear Authors,

I have used the AutoBridge tool to compile my original CNN accelerator which was implemented on one SLR before. It has a PE with an unrolling factor of 8*32. It works well without manually specifying SLR connections.

However, when I want to increase the unrolling factor to 16*64, it reports that AutoBridge fails to find a solution. The detailed information from the .log file is shown as follows. It seems that I used only 78.9 BRAMs and 60.9% DSPs in total, so I guess the design size is not so large. As for each function module, the PE is inside the ACC function which is really hard for me to make the partition manually. Is it possible to provide some suggestions to solve this problem? Thank you.

Best, Yue

autobridge-Sep-20-2022-07%3A00.log

Licheng-Guo commented 1 year ago

You have to break your task into smaller tasks. For example, https://github.com/UCLA-VAST/tapa/blob/release/regression/cnn/tapa/src/tapa_kernel.cpp

tangyuelm commented 1 year ago

Hello Licheng,

Thank you for the reply and example. I will try to break the task.

Best, Yue

logosAllen commented 1 year ago

I suggest that you could try the vadd example, and start from there to implement your design.

You need to set the FIFO depth as appropriate to your specific design. You could refer to our FPGA'21 paper for some tips.

We've never seen a situation where increasing the FIFO depth will cause data loss, could you provide how to reproduce the issue?

@Blaok knows more about the timer and how to run the Jacobi example correctly.

Hi, could you tell me which paper "FPGA'21" is? AutoSA or AutoBridge?

I am looking for the same flow of vitis_hls dataflow cosim read/write block time analysis like XD099 second lab. https://xilinx.github.io/Vitis-Tutorials/2022-1/build/html/docs/Hardware_Acceleration/Feature_Tutorials/03-dataflow_debug_and_optimization/fifo_sizing_and_deadlocks.html

What the different between the idea of "FPGA'21" and XD099 second lab. Thanks.

UCLA-VAST / tapa

Questions of measuring the latency and using the FIFO stream #114