chili-chips-ba / openCologne

Spicing up the first and only EU FPGA chip with a flashy new board, loaded with a suite of engaging demos and examples. https://www.chili-chips.xyz/open-cologne
https://nlnet.nl/project/openCologne
BSD 3-Clause "New" or "Revised" License
46 stars 2 forks source link

Timing constraints and 'report_timing' in CologneChip proprietary PNR?! #18

Closed chili-chips-ba closed 2 months ago

chili-chips-ba commented 4 months ago

Here is a good question from @TurboVega:

"... does anyone know how (command line option, maybe) to get more detailed information out of the P/R tool, such that we can determine exactly what "paths" are responsible for the maximum clock rates computed by the tool? In its output, it just labels the clocks with generated names, as it does when referring to various FPGA internal components, making it hard to know what Verilog source entities are involved. Knowing what parts of the source affect long paths helps in optimizing speed..."

pu-cc commented 4 months ago

Due to mapping and various optimizations during implementation in P&R, it is not possible to keep all signals and names for cross-referencing. However, registers remain identical, and can be found in the *.crf output. The file is generated automatically after P&R if the +crf flag is set.

Here is an example: You find the critical path information with highlighted start and end in the P&R log: image

The CPE names are made up of the component number (_a) and the CPE part (/1) or (/2). In this example, the starting flip-flop has component number 110, part 2 (110/2 or _a110/OUT2). You find it in the CRF file as follows: image

In the post-synthesis netlist (*_synth.v), it has the instance name _3208_, and you will find the flip-flop with reference to downsampler_inst.generalcounter[15] in your code: image

Similarly, we also find the target flip-flop (100/1 or _a100/OUT1) in the CRF: image

In the post-synthesis netlist, you find it as instance _3199_: image

In order to optimizing your critical path, you could now examine the path between thegeneralcounter[{15,6}] registers in your code and optimize it if necessary.

TurboVega commented 4 months ago

Thanks very much for your reply.

On Thu, Jul 11, 2024, 2:19 AM Patrick Urban @.***> wrote:

Due to mapping and various optimizations during implementation in P&R, it is not possible to keep all signals and names for cross-referencing. However, registers remain identical, and can be found in the *.crf output. The file is generated automatically after P&R if the +crf flag is set.

Here is an example: You find the critical path information with highlighted start and end in the P&R log: image.png (view on web) https://github.com/chili-chips-ba/openCologne/assets/14027986/a0c46fa0-62ce-4606-b3d9-a4926fafbcfa

The CPE names are made up of the component number (_a) and the CPE part (/1) or (/2). In this example, the starting flip-flop has component number 110, part 2 (110/2 or _a110/OUT2). You find it in the CRF file as follows: image.png (view on web) https://github.com/chili-chips-ba/openCologne/assets/14027986/1a44d24b-7768-446b-8344-389665e2fe2e

In the post-synthesis netlist (*_synth.v), it has the instance name 3208, and you will find the flip-flop with reference to downsampler_inst.generalcounter[15] in your code: image.png (view on web) https://github.com/chili-chips-ba/openCologne/assets/14027986/c15bf2fc-5e71-4ff0-85a6-9957c18f0c1d

Similarly, we also find the target flip-flop (100/1 or _a100/OUT1) in the CRF: image.png (view on web) https://github.com/chili-chips-ba/openCologne/assets/14027986/f7fcb4d4-ce25-406b-bfd6-51db9f7bf303

In the post-synthesis netlist, you find it as instance 3199: image.png (view on web) https://github.com/chili-chips-ba/openCologne/assets/14027986/40982851-1cd4-4dd3-a2c7-6454827e8553

In order to optimizing your critical path, you could now examine the path between thegeneralcounter[{15,6}] registers in your code and optimize it if necessary.

— Reply to this email directly, view it on GitHub https://github.com/chili-chips-ba/openCologne/issues/18#issuecomment-2222122625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFUATMPO2IR5JTJMGRRJ6DZLYPXLAVCNFSM6AAAAABKWDWKN2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRSGEZDENRSGU . You are receiving this because you were mentioned.Message ID: @.***>

chili-chips-ba commented 4 months ago

@pu-cc good tips 💯

Still, how do we do random timing queries (such as report_timing) in the CologneChip framework?

Is there a document that describes scripts and procedures to use if they are not based on the de-facto industry standard SDC?!

chili-chips-ba commented 4 months ago

@pu-cc - How do we go about specifying timing constraints for GateMate?

From earlier experience (see this, nextpnr is not very too timing-savvy, if at all (@MikeReznikov for additional comment).

While we expect CologneChip proprietary P_R tool to be better than nextpnr in terms of timing awareness, this is to seek additional info on that topic.

chili-chips-ba commented 2 months ago

@pu-cc , @DadoCCAG -- Your answers to the above questions have become uber-critical at this point!

We are seeing that PicoRV32, which is the essential element of our TetriSaraj application, does not work properly at 100MHz. There are timing violations in hardware. They are not reported, which is expected, as we currently don't have any clock constraints in the build.

While we have blindly reduced PicoRV32 clock to 10MHz to "make it work" (or at least so appear) without any timing constraints, we don't know for a fact whether that's sufficiently slow.

Builds without timing constraints are not acceptable in the long run. Moreover, inability to specify timing constraints is simply a showstopper for commercial / professional projects and settings.

TarikHamedovic commented 2 months ago

I went through the GateMate documentation and found this line:

Furthermore, the netlist is passed to the Place & Route tool for architecture-specific im-
plementation and bitstream generation. A netlist converter generates a generic netlist
from the Yosys or legacy netlist. The first steps of Place & Route comprise procedures for
speed or area optimization before mapping. After placement and routing, the static tim-
ing analysis (STA) might lead to further optimization steps and makes the Place & Route
software an iterative process of constraint-driven re-placement and re-routing steps to
finally achieve user requirements.

In which it says that after P&R there is an STA, but looking through the pages 80-86 of the GateMate FPGA Datasheet there are no options to specify a clock constraint as other FPGA vendors have. And also there is no mention of a clock constraint in their workflow diagram below.

image

chili-chips-ba commented 2 months ago

... this calls for some questions: 1) What criteria are used for *constraints-driven placement* and *constraints-driven routing* in the situation when even the elementary clock period cannot be specified?! 2) What's the scope of *STA implementation step* in this context, w/o timing constraints whatsoever?

chili-chips-ba commented 2 months ago

@pu-cc it's interesting that your own PicoRV32 constraints for GateMate are also alluding to 10MHz clock. Granted, even your CCF has it only as a comment, as opposed to the actual clock constraint. image

Is it that you simply "feel comfortable" with 10MHz, based on your extensive empirical trial-and-error?! Note that PicoRV32 in both Xilinx and Gowin ports of TetriSaraj runs reliably at 100MHz+.

@DadoCCAG, in order for us to compare eduBOS5 GateMate timing performance to that of Xilinx and Gowin, we absolutely need to have a reliable way for specifying timing constraints, i.e. validating timing closure.

pu-cc commented 2 months ago

Is it that you simply "feel comfortable" with 10MHz [...]

No, not at all. Let me briefly address the most important points:

Placement takes place using the quadratic placement algorithm. After all signals have been routed, p_r always runs an STA. This can also be seen in the log file:

[...]
Static Timing Analysis

Skew violation report using only 80% delay of data path
[...]

STA takes the current placement as a basis and calculates the maximum achievable frequency for all clocks, as I have shown in my first answer. Each clock reports a maximum clock frequency and it's critical path.

Moreover, STA checks for clock skew and applies measures to reduce it.

Once the STA has finished, it should be ensured that the timing for the clock specified in the report is achieved.

In my experiments, picorv and vexcrisv reached about 30-50 Mhz (worst corner).

chili-chips-ba commented 2 months ago

@pu-cc given that the necessary timing information is available in the P_R database, what would it take to bring the flow from its current reactive* timing closure methodology up to something that at least on surface resembles the mainstream pro-active approach?!

Here is an idea:

1) allow declaration of the basic clock constraint in the CCF 2) provide post-processing script that would extract all Fmax reports from the P_R log and compare them to the declared input clock frequencies, flagging violations when below, and displaying the extent of headroom when met 3) in the next phase, build on top of it to add support for generated clocks 4) eventually add ability to parse the database and support report_timing command


(*) the current P_R is apparently not timing-driven. We understand that the P_R is using quadratic placement algorithm.