doonny / PipeCNN

An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
Apache License 2.0
1.26k stars 369 forks source link

De1-SoC Performance Different from Example Given ! #143

Open mingyi136 opened 4 years ago

mingyi136 commented 4 years ago

Hai @doonny , I have run inference on De1-SoC board with VEC_SIZE=8 and LANE_NUM=8 (other parameters remain unchanged).

However, the Total kernel runtime is 236.344 ms instead of 149.988 ms given in the example. Is there any additional changes in parameters / coding compared to old version?

Here is my inference result:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************

Platform: Intel(R) FPGA SDK for OpenCL(TM)
Totally 1 device(s) are found
  Using Device 0: de1soc_sharedonly : Cyclone V SoC Development Kit
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 16.1
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 512 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz

Loading kernel/binary from file conv.aocx
Reprogramming device [0] with handle 1

61063552 total weights read
1024 total output reference read

154587 bytes image data read from binary files

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 55, 55, 96)

Launching single work-item kernel Pool

Launching kernel lrn with local size: 1, 1, 12  (global size: 27, 27, 12)

Executing Layer 2:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 27, 27, 256)

Launching single work-item kernel Pool

Launching kernel lrn with local size: 1, 1, 32  (global size: 13, 13, 32)

Executing Layer 3:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 384)

Executing Layer 4:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 384)

Executing Layer 5:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 256)

Launching single work-item kernel Pool

Executing Layer 6:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 4096)

Executing Layer 7:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 4096)

Executing Layer 8:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 1024)

Copyed all batched results from fc_1 buffers.
Selected item = 0 from the combined batch results in fc buffers

Start verifying results ...

Check Pass !!!

The inference result is n02123045 tabby, tabby cat   (the prob is 56.00)

PipeCNN exited !!!

-------------------

Performance Summary

Kernel runtime summary:
  Layer-1:
    MemRd: 70.738 ms
    Conv : 70.578 ms
    Pool : 70.346 ms
    MemWr: 70.486 ms
    Lrn  : 1.383 ms
  Layer-2:
    MemRd: 56.435 ms
    Conv : 56.304 ms
    Pool : 56.106 ms
    MemWr: 56.241 ms
    Lrn  : 0.456 ms
  Layer-3:
    MemRd: 39.022 ms
    Conv : 38.899 ms
    Pool : 0.000 ms
    MemWr: 38.827 ms
    Lrn  : 0.000 ms
  Layer-4:
    MemRd: 28.978 ms
    Conv : 28.854 ms
    Pool : 0.000 ms
    MemWr: 28.788 ms
    Lrn  : 0.000 ms
  Layer-5:
    MemRd: 19.408 ms
    Conv : 19.272 ms
    Pool : 19.081 ms
    MemWr: 19.209 ms
    Lrn  : 0.000 ms
  Layer-6:
    MemRd: 14.490 ms
    Conv : 14.371 ms
    Pool : 0.000 ms
    MemWr: 14.262 ms
    Lrn  : 0.000 ms
  Layer-7:
    MemRd: 6.538 ms
    Conv : 6.423 ms
    Pool : 0.000 ms
    MemWr: 6.344 ms
    Lrn  : 0.000 ms
  Layer-8:
    MemRd: 1.758 ms
    Conv : 1.642 ms
    Pool : 0.000 ms
    MemWr: 1.562 ms
    Lrn  : 0.000 ms

Total kernel runtime 236.344 ms
Batch size = 1, average process time per batch: 236.344 ms

Total runtime: 0.241783s
mingyi136 commented 4 years ago

As referring to this issue: https://github.com/doonny/PipeCNN/issues/46#issue-296168982

Total kernel runtime of 149.988 ms on DE1-Soc Board seem like have been achieved with parameter VEC_SIZE=8 and LANE_NUM=8, too

mingyi136 commented 4 years ago

@doonny I have regenerated both run.exe & conv.aocx using PipeCNN github of 2018 (VEC_SIZE=8 and LANE_NUM=8).

This time I managed to get Total kernal runtime of 157.928 ms. Just wondering, why the latest version of PipeCNN github give a slower inference performance for DE1-SoC board?

doonny commented 4 years ago

May I ask which version of SDK are you using for compilation ?

sergio14890 commented 4 years ago

@mingyi136 where you download the bsp for de1soc board?

mingyi136 commented 4 years ago

May I ask which version of SDK are you using for compilation ?

@doonny I have compiled conv.aocx using OpenCL SDK 17.1 on Windows, whereas run.exe has been compiled using OpenCL SDK 16.1 on DE1-SoC board (Linux).

sergio14890 commented 4 years ago

May I ask which version of SDK are you using for compilation ?

@doonny I have compiled conv.aocx using OpenCL SDK 17.1 on Windows, whereas run.exe has been compiled using OpenCL SDK 16.1 on DE1-SoC board (Linux).

Ahhh okk, i try compile using openCL SDK 17,1, but i have this error: https://github.com/doonny/PipeCNN/issues/135

You have a license for 16.1 SDK?

mingyi136 commented 4 years ago

@sergio14890 , I downloaded Linux SD Card Image (inside has OpenCL 16.1) from here: https://software.intel.com/content/www/us/en/develop/topics/fpga-academic/learn/tutorials.html and use it to compile run.exe.

Whereas conv.aocx has been compiled using OpenCL 17.1 & pointing towards DE1-SoC BSP (OpenCL 16.0) which available here: https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=836&PartNo=4

doonny commented 4 years ago

The latest code is optimized for SDK v19.1, in which some features are not supported by older version, like v16.1 and v17.1. We suggest upgrade the SDK to v19.1.