bo3z commented 1 year ago

Description

While testing some code unrolling for the hls4ml Optimisation API, I noticed that inlining in Conv2D can allocate unnecessary RAM.

When tested on the current version of Conv2D (line buffer, streaming, Resource strategy, RF > 1), there is a significant difference in latency (between 3x and nearly 20x)

Still unsure what cause this bug and if it is present for (i) Latency strategy, (ii) RF = 1 and (iii) encoded convolution. But is certainly seems that for RF > 1 in Resource to seem a bug. Opening this as a discussion until further synthesis results are obtained.

Type of change

[x] Bug fix
[x] Breaking change (potentially)
[x] Discussion

Tests

Below are report files following a full Vivado synthesis and CoSim analysis, for the SVHN paper model, with RF = 9

Underscored _master, corresponds to implementations of the current, line-buffer, resource, streaming Conv2D

Underscored _no_pragma, corresponds to implementations with the inline keyword removed, as per the PR

Inspecting the report files, the models are clearly equivalent (in terms of HLS config and architecture) as they use the same number of DSPs and BRAM and similar utilisation of LUT & FF. However, latencies differ up to 20x.

Source of report files: https://cernbox.cern.ch/s/DK4v2KUTiBmFvYN

Checklist

[x] I have read the guidelines for contributing.
[x] I have commented my code, particularly in hard-to-understand areas.
[x] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[x] I have installed and run pre-commit on the files I edited or added.
[x] I have added tests that prove my fix is effective or that my feature works.

jmitrevs commented 1 year ago

The changes are pretty minimal and should do no harm. If they fix an issue, I am not opposed to merging them, even if we don't fully understand why.

jmitrevs commented 1 year ago

I think I'll go ahead and merge this since it causes no problems and seems to help. We can still try to understand why this is, though.

bo3z commented 1 year ago

I spoke briefly with @vloncar about this - I think when the RTL's are inlined the two pipelined designs will conflict each other and the compiler gets confused. I'm still not sure if this change would slow down Latency strategy (if it does, it would be by one clock cycle) - worth investigating further.

fastmachinelearning / hls4ml

Discussion - Inlined Conv slows down latency significantly (up to x15 - x20) #800

Description

Type of change

Tests

Checklist