fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.24k stars 402 forks source link

Discussion - Inlined Conv slows down latency significantly (up to x15 - x20) #800

Closed bo3z closed 1 year ago

bo3z commented 1 year ago

Description

  • While testing some code unrolling for the hls4ml Optimisation API, I noticed that inlining in Conv2D can allocate unnecessary RAM.
  • When tested on the current version of Conv2D (line buffer, streaming, Resource strategy, RF > 1), there is a significant difference in latency (between 3x and nearly 20x)
  • Still unsure what cause this bug and if it is present for (i) Latency strategy, (ii) RF = 1 and (iii) encoded convolution. But is certainly seems that for RF > 1 in Resource to seem a bug. Opening this as a discussion until further synthesis results are obtained.

Type of change

Tests

  • Below are report files following a full Vivado synthesis and CoSim analysis, for the SVHN paper model, with RF = 9
  • Underscored _master, corresponds to implementations of the current, line-buffer, resource, streaming Conv2D
  • Underscored _no_pragma, corresponds to implementations with the inline keyword removed, as per the PR
  • Inspecting the report files, the models are clearly equivalent (in terms of HLS config and architecture) as they use the same number of DSPs and BRAM and similar utilisation of LUT & FF. However, latencies differ up to 20x.
  • Source of report files: https://cernbox.cern.ch/s/DK4v2KUTiBmFvYN

Checklist

jmitrevs commented 1 year ago

The changes are pretty minimal and should do no harm. If they fix an issue, I am not opposed to merging them, even if we don't fully understand why.

jmitrevs commented 1 year ago

I think I'll go ahead and merge this since it causes no problems and seems to help. We can still try to understand why this is, though.

bo3z commented 1 year ago

I spoke briefly with @vloncar about this - I think when the RTL's are inlined the two pipelined designs will conflict each other and the compiler gets confused. I'm still not sure if this change would slow down Latency strategy (if it does, it would be by one clock cycle) - worth investigating further.