chili-chips-ba / openCologne

Spicing up the first and only EU FPGA chip with a flashy new board, loaded with a suite of engaging demos and examples. https://www.chili-chips.xyz/open-cologne
https://nlnet.nl/project/openCologne
BSD 3-Clause "New" or "Revised" License
44 stars 2 forks source link

Validation and introduction of CDC Linter into project QA workflow #26

Closed chili-chips-ba closed 2 months ago

chili-chips-ba commented 3 months ago

https://github.com/YosysHQ/yosys/issues/4493#issuecomment-2245725950 https://github.com/BerkeleyLab/Bedrock/blob/master/build-tools/cdc_snitch.md

TarikHamedovic commented 3 months ago

I added CDC Checking to our OPL3 example here, and I'm in the process of adding it into our workflow. Before you run it, you need to clone this repository . And export the path to where you cloned it using TOOL_ROOT. You can run the CDC check simply with:

make cdc

It gives the following output:

Warnings: 51 unique messages, 69 total
End of script. Logfile hash: 8d7feaae3b, CPU: user 7.15s system 0.15s, MEM: 194.10 MB peak
Yosys 0.40+22 (git sha1 fa0c5c1d4, clang++ 14.0.0-1ubuntu1.1 -fPIC -Os)
Time spent: 24% 47x opt_expr (1 sec), 17% 1x techmap (1 sec), ...
python3 /home/user/FPGA/tools/Bedrock/build-tools/cdc_snitch.py /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.json -o /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.txt
json from Yosys 0.40+22 (git sha1 fa0c5c1d4, clang++ 14.0.0-1ubuntu1.1 -fPIC -Os)
modules: opl3
driverless 746 host_if.dout_sync.in[0]
driverless 748 host_if.dout_sync.in[1]
driverless 750 host_if.dout_sync.in[2]
driverless 752 host_if.dout_sync.in[3]
driverless 754 host_if.dout_sync.in[4]
driverless 756 host_if.dout_sync.in[5]
driverless 758 host_if.dout_sync.in[6]
driverless 760 host_if.dout_sync.in[7]
OK1: 1579  CDC: 0  OKX: 56  BAD: 0
chili-chips-ba commented 3 months ago

It feels good to know that CDC_SNITCH has also identified UNDRIVEN_INPUTS, which we initially missed with VerilatoršŸ‘.

@gtaylormb do you agree that there are 56 intentional clock crossings in OPL3 design? @TarikHamedovic is there a way to list all elements in the OKX:56 category for Greg to review?!

image

@ldoolitt - Is it fair to try to give OK1 and OKX another spin, by inferring from them that there are cca 1579+56=1635 flops in this design? How exactly are these two counts constructed?

image

ldoolitt commented 3 months ago

It feels good to know that CDC_SNITCH has also identified UNDRIVEN_INPUTS, which we initially missed with Verilator

That has nothing to do with cdc_snitch (not capitalized), and everything to do with processing the input HDL with yosys.

is there a way to list all elements in the OKX:56 category for Greg to review?!

grep "^OKX:" /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.txt

Is it fair to try to give OK1 and OKX another spin, by inferring from them that there are cca 1579+56=1635 flops in this design?

I don't understand the question.

How exactly are these two counts constructed?

It's the count of (yosys-synthesized) DFF that match each of the described patterns.

I'm working with another team-member at LBNL to better document the -o file emitted by cdc_snitch.py. But it is plain text, and is supposed to be pretty self-explanatory. Take a look.

ldoolitt commented 3 months ago

I'm working with another team-member at LBNL to better document the -o file emitted by cdc_snitch.py.

Check out latest bedrock master, commit 87b8a100. We added a section to cdc_snitch.md titled "Reading the output".

As is typical for open source projects, getting comprehensible documentation in the hands of users can be a challenge. We welcome comments, questions, and even patches!

chili-chips-ba commented 3 months ago

Does cdc_snitch count only end-points, i.e. the recipients of the CDC'd signals?!

In other words, is that the reason for discrepancy between total number of flops in the OPL3 design (2842) and 1579+56=1635 flop metric that cdc_snitch reported? Could we from it deduce that 2842 - 1635 = 1207 flops are driven directly from primary inputs, and therefore excluded from CDC analysis?

synth.log

     CC_ADDF                       620
     CC_BRAM_20K                     9
     CC_BUFG                         2
     CC_DFF                       2842
     CC_IBUF                        17
     CC_LUT1                        98
     CC_LUT2                       547
     CC_LUT3                       671
     CC_LUT4                       556
     CC_MULT                         1
     CC_MX4                        696
     CC_OBUF                        62
ldoolitt commented 3 months ago

Does cdc_snitch count only end-points, i.e. the recipients of the CDC'd signals?!

Yes.

Could we from it deduce that 2842 - 1635 = 1207 flops are driven directly from primary inputs, and therefore excluded from CDC analysis?

No, that doesn't make sense.

I'd love to give a more complete answer based on my own experiments with your code base. After 20 min of fussing around, I'm still tripped up by your build system. After I git clone, how do I get to the point where "make cdc" will work? Yes, I know how to set CDC_DIR.

$ make cdc
make: *** No rule to make target '/home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/../1.hw/modules/misc/src/afifo.v', needed by 'cdc_ys'.  Stop.

We can take this to PM, if you want.

ldoolitt commented 3 months ago

If I bypass the build system, and start running yosys by hand, I get stuck at

1. Executing Verilog-2005 frontend: ../1.hw/modules/channels/sim/control_operators_tb.sv
Parsing SystemVerilog input from `../1.hw/modules/channels/sim/control_operators_tb.sv' to AST representation.
../1.hw/modules/channels/sim/control_operators_tb.sv:45: ERROR: syntax error, unexpected TOK_ID

This is with Yosys 0.40 (git sha1 a1bb0255d, g++ 10.2.1-6 -fPIC -Os). What's the minimum version if yosys that openCologne needs?

chili-chips-ba commented 3 months ago

Hi @ldoolitt, you are right -- While project Makefile contains a lot of useful info about tools that need to be installed, it is still WIP. Consequently, some dependencies are missing. When it comes to issue you ran into, it assumes that you will first manually run the preparatory steps. SV2V is one of them.

@TarikHamedovic will reach out to you in a PM for additional detail...

TarikHamedovic commented 3 months ago

Hello,

The problem should be fixed now, if you clone our repository and run:

make cdc

We ran cdc on the generated Verilog from sv2v. I have a script that you can run ./create_rtl_filelist.sh -v where placing the '-v' parameter you get all of the Verilog files. The problem is that base yosys is very bad with SystemVerilog, the error you saw was that yosys can't make sense of SystemVerilog packages. With the latest commit I add Verilog generation before the make cdc command so it should be fixed now.

ldoolitt commented 3 months ago

Trying to stay on-topic:

(1) The BAD:0 conclusion looks correct. The design considers and handles CDC concerns well. In particular, since it describes the whole chip, it includes registers capturing each input pin. In the cdc_snitch documentation (section "I/O"), you'll see that case doesn't require the addition of a clock-domain-setting skin to get useful results.

(2) For study purposes, I can apply a patch openCologne_cdc.txt to your Verilog (as of commit ed7fd7d8), marking the known CDC locations. After that, the summary is OK1: 1579 CDC: 26 OKX: 30 BAD: 0 where the OKX are all resets:

$ grep "^OKX" cdc_check.txt
OKX  590 reset_sync.r0:R clk clk inputs ( 1 x ic_n )
OKX  591 reset_sync.r1:R clk clk inputs ( 1 x ic_n )
OKX  592 reset_sync.reset:R clk clk inputs ( 1 x ic_n )
OKX  598 host_if.afifo.rgray_cross[0]:R clk clk_host inputs ( 1 x ic_n )
OKX  600 host_if.afifo.rgray_cross[1]:R clk clk_host inputs ( 1 x ic_n )
OKX  602 host_if.afifo.rgray_cross[2]:R clk clk_host inputs ( 1 x ic_n )
OKX  604 host_if.afifo.rgray_cross[3]:R clk clk_host inputs ( 1 x ic_n )
OKX  606 host_if.afifo.rgray_cross[4]:R clk clk_host inputs ( 1 x ic_n )
OKX  608 host_if.afifo.rgray_cross[5]:R clk clk_host inputs ( 1 x ic_n )
OKX  610 host_if.afifo.rgray_cross[6]:R clk clk_host inputs ( 1 x ic_n )
OKX  611 host_if.afifo.wr_rgray[0]:R clk clk_host inputs ( 1 x ic_n )
OKX  612 host_if.afifo.wr_rgray[1]:R clk clk_host inputs ( 1 x ic_n )
OKX  613 host_if.afifo.wr_rgray[2]:R clk clk_host inputs ( 1 x ic_n )
OKX  614 host_if.afifo.wr_rgray[3]:R clk clk_host inputs ( 1 x ic_n )
OKX  615 host_if.afifo.wr_rgray[4]:R clk clk_host inputs ( 1 x ic_n )
OKX  616 host_if.afifo.wr_rgray[5]:R clk clk_host inputs ( 1 x ic_n )
OKX  617 host_if.afifo.wr_rgray[6]:R clk clk_host inputs ( 1 x ic_n )
OKX  676 host_if.afifo.wr_addr[0]:R clk clk_host inputs ( 1 x ic_n )
OKX  678 host_if.afifo.wr_addr[1]:R clk clk_host inputs ( 1 x ic_n )
OKX  680 host_if.afifo.wr_addr[2]:R clk clk_host inputs ( 1 x ic_n )
OKX  682 host_if.afifo.wr_addr[3]:R clk clk_host inputs ( 1 x ic_n )
OKX  684 host_if.afifo.wr_addr[4]:R clk clk_host inputs ( 1 x ic_n )
OKX  686 host_if.afifo.wr_addr[5]:R clk clk_host inputs ( 1 x ic_n )
OKX  618 host_if.afifo.wgray[0]:R clk clk_host inputs ( 1 x ic_n )
OKX  620 host_if.afifo.wgray[1]:R clk clk_host inputs ( 1 x ic_n )
OKX  622 host_if.afifo.wgray[2]:R clk clk_host inputs ( 1 x ic_n )
OKX  624 host_if.afifo.wgray[3]:R clk clk_host inputs ( 1 x ic_n )
OKX  626 host_if.afifo.wgray[4]:R clk clk_host inputs ( 1 x ic_n )
OKX  628 host_if.afifo.wgray[5]:R clk clk_host inputs ( 1 x ic_n )
OKX  630 host_if.afifo.wr_addr[6]:R clk clk_host inputs ( 1 x ic_n )

Note all the :R in the signal names. The asynchronous reset logic in your design is typical for ASIC, but is not something that shows up in a typical FPGA design, and is not currently fully handled by cdc_snitch. I guess I'm fuzzy on how it really works on an ASIC, without inviting glitches at startup, if the clocks are running when the reset phase ends.

(3) The total register count is at least partially dependent on how yosys processes the design. If you were using a closed-source vendor tool for your final synthesis, that would be where this thread stops. But because you are also using yosys, you can make a couple of modifications to your yosys flow, incorporating elements of cdc_snitch_proc.ys, add the write_json step, and post-process that result with cdc_snitch.py. Then the register counts should match your expectations exactly. I don't know if any tweaks will be needed to cdc_snitch.py. If you don't get quick success, send me the .json file and I'll see what I can do.

gtaylormb commented 3 months ago

I guess I'm fuzzy on how it really works on an ASIC, without inviting glitches at startup, if the clocks are running when the reset phase ends.

https://github.com/gtaylormb/opl3_fpga/blob/master/fpga/modules/clks/src/reset_sync.sv

Assertion is asynchronous driven by the global reset (or synchronous reset from another clock domain); deassertion is synchronous to the local clock and occurs on the output after input reset deassertion propagates 3 flops (so no glitches), which also act as synchronizers to filter out any metastability. This output reset is then used as a synchronous reset to drive all the resets on that particular clock domain.

ldoolitt commented 3 months ago

deassertion is synchronous to the local clock

That would make sense if there were a single local clock. But as you see from the list above, the same ic_n signal is used to reset DFF in both the clk and clk_host domains.

chili-chips-ba commented 3 months ago

It seems that ic_n is assumed to be in the clk_host domain from the get-go, hence sync'd only to clk domain. Proper I/O constraints for CDC ("skin" in cdc_snitch) lingo would have taken care of it.

Regarding register counts expectations, it would help to understand why we need to do these extra steps. What was the problem in the first place, and how are we attending to them by doing this JSON juggle?!

ldoolitt commented 3 months ago

It seems that ic_n is assumed to be in the clk_host domain from the get-go, hence sync'd only to clk domain. Proper I/O constraints for CDC ("skin" in cdc_snitch) lingo would have taken care of it.

Sorry, that still makes no sense to me. There's no mechanism to sync ic_n to both clk and clk_host. But it's being used to reset signals in both domains. That's a recipe for glitches.

Regarding register counts expectations, it would help to understand why we need to do these extra steps.

What extra steps? Your original question was (as I understand it) why two different yosys processing chains gave two different register counts. My answer is to use the same chain (your actual synthesis flow) for CDC-checking, and ignore the weird not-bound-to-physical-reality generic chain I wrote for cdc_snitch.

chili-chips commented 3 months ago

One can sync a signal to as many clocks as he needs to, such as by using reset_sync block multiple times, once for each clock domain of interest. Also curious what @gtaylormb, the RTL author, has to say about this question.

We did not realize that the stock _cdcsnitch was based on "...the weird not-bound-to-physical-reality generic chain...".

More generally, why bother reporting those metrics if they are not bound to physical reality?

gtaylormb commented 3 months ago

Sorry, that still makes no sense to me. There's no mechanism to sync ic_n to both clk and clk_host. But it's being used to reset signals in both domains. That's a recipe for glitches.

When you say ic_n is being used to reset signals in the clk domain, are you referring to these?

OKX  590 reset_sync.r0:R clk clk inputs ( 1 x ic_n )
OKX  591 reset_sync.r1:R clk clk inputs ( 1 x ic_n )
OKX  592 reset_sync.reset:R clk clk inputs ( 1 x ic_n )

These are the reset synchronizer. All internal logic otherwise driven off clk is using the output of the reset synchronizer for reset.

ldoolitt commented 3 months ago

@chili-chips:

cdc_snitch was based on "...the weird not-bound-to-physical-reality generic chain...".

OK, maybe "weird" is a poor choice of words. It is the default yosys techmap step, with optimizations carefully configured to make sure dual-port memories stay whole. This is a fine and correct logical synthesis, and demonstrated useful to detect CDC anti-patterns. It's not tuned and validated for a specific technology, e.g., ASIC or 4-LUT. Please read cdc_snitch_proc.ys itself, and the yosys techmap docs for more info.

ldoolitt commented 3 months ago

@gtaylormb :

When you say ic_n is being used to reset signals in the clk domain, are you referring to these? OKX 590 reset_sync.r0:R clk clk inputs ( 1 x ic_n ) OKX 591 reset_sync.r1:R clk clk inputs ( 1 x ic_n ) OKX 592 reset_sync.reset:R clk clk inputs ( 1 x ic_n ) These are the reset synchronizer. All internal logic otherwise driven off clk is using the output of the reset synchronizer for reset.

Got it. Thanks for the explanation! Repeating my (new) understanding back to you: ic_n is constructed (outside chip.v) as synchronous to clk_host. So the 27 OKX above found in afifo are really OK1. Then the three remaining entries in reset_sync are intentional OKX in a dedicated CDC circuit.

The defect in cdc_snitch is that it has no mechanism to mark reset inputs with its magic_cdc attribute -- only data inputs. That behavior should be documented. Or changed, if you think of a clear way to use attributes to describe the situation.

chili-chips commented 3 months ago

@ldoolitt - Commercial CDC tools have a facility (typically constraints in Tcl format) for individual assignment of clock domain to every input port. In other words, if you are thinking of an enhancement (change for better), there is a well paved path to follow.

@pu-cc it is interesting that Yosys techmap strategy selected by CologneChip has resulted in 2842 DFFs, whereas Larry's ("...with optimizations carefully configured to make sure dual-port memories stay whole ... not tuned and validated for a specific technology, e.g., ASIC or 4-LUT...") resulted in the mere 1635(!)

The first knee-jerk reaction is that the CologneChip tuning of Yosys is costing OPL3 FPGA 1207 DFFs. But, we are sure there is more to it. What are we missing?

pu-cc commented 3 months ago

@pu-cc it is interesting that Yosys techmap strategy selected by CologneChip has resulted in 2842 DFFs, whereas Larry's ("...with optimizations carefully configured to make sure dual-port memories stay whole ... not tuned and validated for a specific technology, e.g., ASIC or 4-LUT...") resulted in the mere 1635(!)

You will find the answer in the synthesis log in https://github.com/chili-chips-ba/openCologne/issues/26#issuecomment-2254210868.

Memory mapping occurs in pass no. 33.28. Block RAMs in FPGAs typically only feature synchronous reads. For asynchronous reads, LUT-RAM is required. This is not a feature in GateMate, and so the LUT-RAM is ā€œemulatedā€ using FFs.

[...]
33.28. Executing MEMORY_LIBMAP pass (mapping memories to cells).
using FF mapping for memory opl3.$flatten\channels.\control_operators.\operator.\envelope_generator.\ksl_add_rom.$auto$proc_rom.cc:155:do_switch$2642
mapping memory opl3.$flatten\channels.\control_operators.\operator.\phase_generator.\exp_lut_inst.$auto$proc_rom.cc:155:do_switch$2634 via $__CC_BRAM_TDP_
mapping memory opl3.$flatten\channels.\control_operators.\operator.\phase_generator.\log_sine_lut_inst.$auto$proc_rom.cc:155:do_switch$2646 via $__CC_BRAM_TDP_
using FF mapping for memory opl3.channels.ch_abcd_cnt_mem.bankgen[0].genblk1.mem_bank.ram
using FF mapping for memory opl3.channels.ch_abcd_cnt_mem.bankgen[1].genblk1.mem_bank.ram
[...]
ldoolitt commented 3 months ago

@chili-chips-ba

a facility (typically constraints in Tcl format) for individual assignment of clock domain to every input port

It's conceptually easy for cdc_snitch to take in such a mapping. Tcl format wouldn't be my first choice.

The existing documented strategy is to have an additional layer of Verilog that pins clock domains of inputs and outputs. In your use case, where only one of nine input ports needs special handling, supporting some kind of special-case map would certainly simplify the flow.

Feel free to supply a patch, or propose an input syntax.

chili-chips-ba commented 3 months ago

Additional layer of user-supplied Verilog is impractical for the IP-style blocks, i.e. the ones we don't directly own. It is also burdensome.

A much better approach would be to add another Python script for: 1) reading standard XDC/SDC constraints to extract clock domain, async timing groups and Input-to-Clock affiliation from 2) and converting them to whatever format _cdcsnitch works with internally for specifying that info

That is exactly how commercial CDC/RDC tools operate....

chili-chips-ba commented 2 months ago

@ldoolitt any follow on thoughts from your side on how you'd like to approach the question of preparing _cdcsnitch constraints?!

chili-chips-ba commented 2 months ago

@ldoolitt please advise if you are OK with closing this issue.