Unable to reproduce FPGA speed-up

kisarur commented 2 years ago

Hello!

I was able to successfully reproduce your FPGA work on an AWS F1 instance and obtain performance results to compare them against pure software performance. To get the pure software performance, I created a 14-threaded software version that executes Minimap2's chaining function - mm_chain_dp (taken from the same Minimap2 software version you have used in your testbed) on an Intel Xeon CPU.

To take the above performance results, I dumped the input/output of 30,000 function calls using the below command with the testbed you've provided with your work.

./minimap2 -ax map-pb c_elegans40x.fastq c_elegans40x.fastq --chain-dump-in in-c_elegans-30k.txt --chain-dump-out out-c_elegans-30k.txt --chain-dump-limit=30000 > /dev/null

The FPGA kernel time I obtained for the above input dump (in-c_elegans-30k.txt) was 1.6356766 seconds. For the same input data, my 14-threaded pure software version could complete its work in 6.649 seconds.

According to above timing results, the speed-up against the 14-threaded software is nearly 4x. However, I could observe that in your publication it mentions that the speed-up against a 14-threaded software reference is 28x.

Could you please let me know if I have missed anything or is there any specific considerations you took when doing the performance comparisons.

Thank you very much!

Kind regards, Kisaru

dotkrnl commented 2 years ago

Hi Kisaru,

I checked with my notes on the raw experimental data we have obtained on this particular test example, our FPGA kernel execution time for it was 0.30 sec instead of your 1.64 sec. I do not find the raw data for CPU since we experimented with the whole data set in the paper. However, our highly optimized CPU kernel was 4.24 sec on this particular input. This roughly matches the results in our paper. Your software version might run on a different CPU than our experiment environment (E5-2680v4), which might also cause some discrepancy.

The huge gap in the FPGA kernel might be caused by the recent updates in the Vitis/Vivado HLS software. We haven't performed experiments on the recents versions. Could you please check if the achieve II in the report is one? Besides, what is the achieved operating frequency? Could you please post the vivado_hls.log/vitis_hls.log so we could look into this issue?

Thanks, Jason

kisarur commented 2 years ago

Hi Jason,

Thank you very much for your quick reply! I have used "FPGA Developer AMI v1.6.1" with "AWS EC2 FPGA Development Kit v1.4.15a" so that I could use the same Xilinx 2018.3 toolset used in your work.

Here I'm sending you the two files device_chain_kernel_vivado_hls.log and xocc_kernel-hw.log (available under kernel/hls/rpt/2018.3/kernel-hw/kernel-hw/ and kernel/hls/rpt/2018.3/kernel-hw/ respectively) which would help you identify the issue.

On a side note, since there was an error in compiling memory_scheduler.cpp with GCC 4.8.5 (related to initializing variable-sized arrays) which was available by default on AWS F1 machine, I had to do a few small modifications in memory_scheduler.cpp to initialize arrays to zero with a loop. For example, I changed

qspan_t avg_qspan[PE_NUM] = {0};

to

qspan_t avg_qspan[PE_NUM];
for (int i = 0; i < PE_NUM; i++) { 
    avg_qspan[i] = 0;
}

However, I don't think the above change would have impacted the performance of the hardware kernel since it's a very minor modification and the time for the memory scheduler/descheduler is not anyway taken into account when evaluating the performance.

Also, the 1.64 sec time I mentioned was the kernel's end-to-end time reported after a successful completion of FPGA kernel execution which includes the data transfer time as well (which I think is what you reported in the paper as well).

Just to be sure that I'm experimenting with the same 30,000 function calls, it would be great if you could send me the total number of anchors (or the total number of lines) available in the in-c_elegans-30k.txt you used.

Thank you very much again for all the help you provide me to identify this issue!

Best, Kisaru

dotkrnl commented 2 years ago

Hi Kisaru,

Thanks for the information. The problem looks more complex than I thought -- you indeed achieved II = 1. Unfortunately Amazon is not longer sponsoring us AWS credits, so I might not be able to rerun the experiments right away. I will try to get some credits if possible. At the meantime, there might be some factors you can take into account to understand the performance difference:

I double checked the code we release here and compared with our internal codebase. The source code here is the version with 8 PEs. The raw experimental data for 8 PEs was 0.40 sec (0.395976), compared to 0.30 sec (0.296977) with 16 PEs. For reproducing the results, you would need to duplicate the PE array and put the data into different memory channels. The kernel open-sourced here is corresponding to the FPGA(8PE,transformed) in Fig. 13.
Note that the data transfer and the computation could be double-buffered and overlapped. The experimental results in Fig. 13 do considered the overlapping. Thus the time is not simply sum(transfer in, compute, transfer out). If you want to intuitively make a comparison, you can use the time of max(transfer in, compute, transfer out).

Thanks, Jason

kisarur commented 2 years ago

Hi Jason,

Thank you very much for looking into this further and trying to reproduce my issue on AWS!

In response to the factors you pointed out,

I see. Even with 8 PEs, if I was able to get 0.4 sec, the speed-up against my 14-threaded software would be 16x which looks awesome. However, as I'm interested in getting the best speed-up possible, I would highly appreciate it if there was a version that I could generate 16 PEs. If it can be easily enabled in the code and I've missed it, please give me instructions on how I could do that.
To get the timing for the dataset, I was using the kernel end-to-end time reported after a FPGA run. For example, here's the timing info that got printed for a sample run with in-c_elegans-30k.txt.
```
****** kernel took 0.657135 seconds to transfer in data
***** kernel took 1.207487 seconds to transfer in and execute
***** kernel took 1.636299 seconds for end-to-end
```

I extracted 1.64 seconds from this expecting that this is the timing considered in the paper. Instead, I think I should take max(transfer in, compute, transfer out) = max(0.66, 0.55, 0.43) = 0.66 secs as the timing for this dataset. However, I believe I cannot achieve this performance straight away from the implementation as double buffering is not implemented in it already?

Thank you!

Best, Kisaru

dotkrnl commented 2 years ago

Hi Kisaru,

By duplicating load_compute_store in device_chain_kernel, adding parameters for device_chain_kernel and mapping them to different channels, you can implement the 16 PEs version. Sorry but I could not find the original source code for experiments. However, the implementation should not take too much time. For example:

void device_chain_kernel(
        block_io *anchors_0,
        block_io *returns_0,
        block_io *anchors_1,
        block_io *returns_1, //...
#pragma HLS interface m_axi port=anchors_0 offset=slave bundle=gmem_0
#pragma HLS interface m_axi port=returns_0 offset=slave bundle=gmem_0
#pragma HLS interface m_axi port=anchors_1 offset=slave bundle=gmem_1
#pragma HLS interface m_axi port=returns_1 offset=slave bundle=gmem_1
load_compute_store(anchors_0, returns_0, n, max_dist_x, max_dist_y, bw);
load_compute_store(anchors_1, returns_1, n, max_dist_x, max_dist_y, bw);

Remember to assign the DDR banks to the bundle: https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/tom1504034303746.html

Double buffering could be implemented in a straight-forward way. Xilinx has some tutorials here.

Thanks, Jason

UCLA-VAST / minimap2-acceleration

Unable to reproduce FPGA speed-up #9