Closed chili-chips-ba closed 2 months ago
I added CDC Checking to our OPL3 example here, and I'm in the process of adding it into our workflow.
Before you run it, you need to clone this repository . And export the path to where you cloned it using TOOL_ROOT
.
You can run the CDC check simply with:
make cdc
It gives the following output:
Warnings: 51 unique messages, 69 total
End of script. Logfile hash: 8d7feaae3b, CPU: user 7.15s system 0.15s, MEM: 194.10 MB peak
Yosys 0.40+22 (git sha1 fa0c5c1d4, clang++ 14.0.0-1ubuntu1.1 -fPIC -Os)
Time spent: 24% 47x opt_expr (1 sec), 17% 1x techmap (1 sec), ...
python3 /home/user/FPGA/tools/Bedrock/build-tools/cdc_snitch.py /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.json -o /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.txt
json from Yosys 0.40+22 (git sha1 fa0c5c1d4, clang++ 14.0.0-1ubuntu1.1 -fPIC -Os)
modules: opl3
driverless 746 host_if.dout_sync.in[0]
driverless 748 host_if.dout_sync.in[1]
driverless 750 host_if.dout_sync.in[2]
driverless 752 host_if.dout_sync.in[3]
driverless 754 host_if.dout_sync.in[4]
driverless 756 host_if.dout_sync.in[5]
driverless 758 host_if.dout_sync.in[6]
driverless 760 host_if.dout_sync.in[7]
OK1: 1579 CDC: 0 OKX: 56 BAD: 0
It feels good to know that CDC_SNITCH has also identified UNDRIVEN_INPUTS, which we initially missed with Verilatorš.
@gtaylormb do you agree that there are 56 intentional clock crossings in OPL3 design?
@TarikHamedovic is there a way to list all elements in the OKX:56
category for Greg to review?!
@ldoolitt - Is it fair to try to give OK1 and OKX another spin, by inferring from them that there are cca 1579+56=1635 flops in this design? How exactly are these two counts constructed?
It feels good to know that CDC_SNITCH has also identified UNDRIVEN_INPUTS, which we initially missed with Verilator
That has nothing to do with cdc_snitch (not capitalized), and everything to do with processing the input HDL with yosys.
is there a way to list all elements in the OKX:56 category for Greg to review?!
grep "^OKX:" /home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/cdc_check.txt
Is it fair to try to give OK1 and OKX another spin, by inferring from them that there are cca 1579+56=1635 flops in this design?
I don't understand the question.
How exactly are these two counts constructed?
It's the count of (yosys-synthesized) DFF that match each of the described patterns.
I'm working with another team-member at LBNL to better document the -o file emitted by cdc_snitch.py. But it is plain text, and is supposed to be pretty self-explanatory. Take a look.
I'm working with another team-member at LBNL to better document the -o file emitted by cdc_snitch.py.
Check out latest bedrock master, commit 87b8a100. We added a section to cdc_snitch.md titled "Reading the output".
As is typical for open source projects, getting comprehensible documentation in the hands of users can be a challenge. We welcome comments, questions, and even patches!
Does cdc_snitch count only end-points, i.e. the recipients of the CDC'd signals?!
In other words, is that the reason for discrepancy between total number of flops in the OPL3 design (2842) and 1579+56=1635 flop metric that cdc_snitch reported? Could we from it deduce that 2842 - 1635 = 1207 flops are driven directly from primary inputs, and therefore excluded from CDC analysis?
CC_ADDF 620
CC_BRAM_20K 9
CC_BUFG 2
CC_DFF 2842
CC_IBUF 17
CC_LUT1 98
CC_LUT2 547
CC_LUT3 671
CC_LUT4 556
CC_MULT 1
CC_MX4 696
CC_OBUF 62
Does cdc_snitch count only end-points, i.e. the recipients of the CDC'd signals?!
Yes.
Could we from it deduce that 2842 - 1635 = 1207 flops are driven directly from primary inputs, and therefore excluded from CDC analysis?
No, that doesn't make sense.
I'd love to give a more complete answer based on my own experiments with your code base. After 20 min of fussing around, I'm still tripped up by your build system. After I git clone, how do I get to the point where "make cdc" will work? Yes, I know how to set CDC_DIR.
$ make cdc
make: *** No rule to make target '/home/user/openCologne/4.Advanced--4--Yamaha-OPL3-FM-Synth/3.build/../1.hw/modules/misc/src/afifo.v', needed by 'cdc_ys'. Stop.
We can take this to PM, if you want.
If I bypass the build system, and start running yosys by hand, I get stuck at
1. Executing Verilog-2005 frontend: ../1.hw/modules/channels/sim/control_operators_tb.sv
Parsing SystemVerilog input from `../1.hw/modules/channels/sim/control_operators_tb.sv' to AST representation.
../1.hw/modules/channels/sim/control_operators_tb.sv:45: ERROR: syntax error, unexpected TOK_ID
This is with Yosys 0.40 (git sha1 a1bb0255d, g++ 10.2.1-6 -fPIC -Os)
.
What's the minimum version if yosys that openCologne needs?
Hi @ldoolitt, you are right -- While project Makefile contains a lot of useful info about tools that need to be installed, it is still WIP. Consequently, some dependencies are missing. When it comes to issue you ran into, it assumes that you will first manually run the preparatory steps. SV2V
is one of them.
@TarikHamedovic will reach out to you in a PM for additional detail...
Hello,
The problem should be fixed now, if you clone our repository and run:
make cdc
We ran cdc on the generated Verilog from sv2v
.
I have a script that you can run ./create_rtl_filelist.sh -v
where placing the '-v' parameter you get all of the Verilog files.
The problem is that base yosys is very bad with SystemVerilog, the error you saw was that yosys can't make sense of SystemVerilog packages.
With the latest commit I add Verilog generation before the make cdc
command so it should be fixed now.
Trying to stay on-topic:
(1) The BAD:0 conclusion looks correct. The design considers and handles CDC concerns well. In particular, since it describes the whole chip, it includes registers capturing each input pin. In the cdc_snitch
documentation (section "I/O"), you'll see that case doesn't require the addition of a clock-domain-setting skin to get useful results.
(2) For study purposes, I can apply a patch openCologne_cdc.txt to your Verilog (as of commit ed7fd7d8), marking the known CDC locations. After that, the summary is OK1: 1579 CDC: 26 OKX: 30 BAD: 0
where the OKX are all resets:
$ grep "^OKX" cdc_check.txt
OKX 590 reset_sync.r0:R clk clk inputs ( 1 x ic_n )
OKX 591 reset_sync.r1:R clk clk inputs ( 1 x ic_n )
OKX 592 reset_sync.reset:R clk clk inputs ( 1 x ic_n )
OKX 598 host_if.afifo.rgray_cross[0]:R clk clk_host inputs ( 1 x ic_n )
OKX 600 host_if.afifo.rgray_cross[1]:R clk clk_host inputs ( 1 x ic_n )
OKX 602 host_if.afifo.rgray_cross[2]:R clk clk_host inputs ( 1 x ic_n )
OKX 604 host_if.afifo.rgray_cross[3]:R clk clk_host inputs ( 1 x ic_n )
OKX 606 host_if.afifo.rgray_cross[4]:R clk clk_host inputs ( 1 x ic_n )
OKX 608 host_if.afifo.rgray_cross[5]:R clk clk_host inputs ( 1 x ic_n )
OKX 610 host_if.afifo.rgray_cross[6]:R clk clk_host inputs ( 1 x ic_n )
OKX 611 host_if.afifo.wr_rgray[0]:R clk clk_host inputs ( 1 x ic_n )
OKX 612 host_if.afifo.wr_rgray[1]:R clk clk_host inputs ( 1 x ic_n )
OKX 613 host_if.afifo.wr_rgray[2]:R clk clk_host inputs ( 1 x ic_n )
OKX 614 host_if.afifo.wr_rgray[3]:R clk clk_host inputs ( 1 x ic_n )
OKX 615 host_if.afifo.wr_rgray[4]:R clk clk_host inputs ( 1 x ic_n )
OKX 616 host_if.afifo.wr_rgray[5]:R clk clk_host inputs ( 1 x ic_n )
OKX 617 host_if.afifo.wr_rgray[6]:R clk clk_host inputs ( 1 x ic_n )
OKX 676 host_if.afifo.wr_addr[0]:R clk clk_host inputs ( 1 x ic_n )
OKX 678 host_if.afifo.wr_addr[1]:R clk clk_host inputs ( 1 x ic_n )
OKX 680 host_if.afifo.wr_addr[2]:R clk clk_host inputs ( 1 x ic_n )
OKX 682 host_if.afifo.wr_addr[3]:R clk clk_host inputs ( 1 x ic_n )
OKX 684 host_if.afifo.wr_addr[4]:R clk clk_host inputs ( 1 x ic_n )
OKX 686 host_if.afifo.wr_addr[5]:R clk clk_host inputs ( 1 x ic_n )
OKX 618 host_if.afifo.wgray[0]:R clk clk_host inputs ( 1 x ic_n )
OKX 620 host_if.afifo.wgray[1]:R clk clk_host inputs ( 1 x ic_n )
OKX 622 host_if.afifo.wgray[2]:R clk clk_host inputs ( 1 x ic_n )
OKX 624 host_if.afifo.wgray[3]:R clk clk_host inputs ( 1 x ic_n )
OKX 626 host_if.afifo.wgray[4]:R clk clk_host inputs ( 1 x ic_n )
OKX 628 host_if.afifo.wgray[5]:R clk clk_host inputs ( 1 x ic_n )
OKX 630 host_if.afifo.wr_addr[6]:R clk clk_host inputs ( 1 x ic_n )
Note all the :R
in the signal names.
The asynchronous reset logic in your design is typical for ASIC, but is not something that shows up in a typical FPGA design, and is not currently fully handled by cdc_snitch
. I guess I'm fuzzy on how it really works on an ASIC, without inviting glitches at startup, if the clocks are running when the reset phase ends.
(3) The total register count is at least partially dependent on how yosys processes the design. If you were using a closed-source vendor tool for your final synthesis, that would be where this thread stops. But because you are also using yosys, you can make a couple of modifications to your yosys flow, incorporating elements of cdc_snitch_proc.ys
, add the write_json
step, and post-process that result with cdc_snitch.py
. Then the register counts should match your expectations exactly. I don't know if any tweaks will be needed to cdc_snitch.py
. If you don't get quick success, send me the .json file and I'll see what I can do.
I guess I'm fuzzy on how it really works on an ASIC, without inviting glitches at startup, if the clocks are running when the reset phase ends.
https://github.com/gtaylormb/opl3_fpga/blob/master/fpga/modules/clks/src/reset_sync.sv
Assertion is asynchronous driven by the global reset (or synchronous reset from another clock domain); deassertion is synchronous to the local clock and occurs on the output after input reset deassertion propagates 3 flops (so no glitches), which also act as synchronizers to filter out any metastability. This output reset is then used as a synchronous reset to drive all the resets on that particular clock domain.
deassertion is synchronous to the local clock
That would make sense if there were a single local clock. But as you see from the list above, the same ic_n
signal is used to reset DFF in both the clk
and clk_host
domains.
It seems that ic_n
is assumed to be in the clk_host
domain from the get-go, hence sync'd only to clk
domain. Proper I/O constraints for CDC ("skin" in cdc_snitch) lingo would have taken care of it.
Regarding register counts expectations, it would help to understand why we need to do these extra steps. What was the problem in the first place, and how are we attending to them by doing this JSON juggle?!
It seems that ic_n is assumed to be in the clk_host domain from the get-go, hence sync'd only to clk domain. Proper I/O constraints for CDC ("skin" in cdc_snitch) lingo would have taken care of it.
Sorry, that still makes no sense to me. There's no mechanism to sync ic_n
to both clk
and clk_host
. But it's being used to reset signals in both domains. That's a recipe for glitches.
Regarding register counts expectations, it would help to understand why we need to do these extra steps.
What extra steps? Your original question was (as I understand it) why two different yosys processing chains gave two different register counts. My answer is to use the same chain (your actual synthesis flow) for CDC-checking, and ignore the weird not-bound-to-physical-reality generic chain I wrote for cdc_snitch
.
One can sync a signal to as many clocks as he needs to, such as by using reset_sync block multiple times, once for each clock domain of interest. Also curious what @gtaylormb, the RTL author, has to say about this question.
We did not realize that the stock _cdcsnitch was based on "...the weird not-bound-to-physical-reality generic chain...".
More generally, why bother reporting those metrics if they are not bound to physical reality?
Sorry, that still makes no sense to me. There's no mechanism to sync ic_n to both clk and clk_host. But it's being used to reset signals in both domains. That's a recipe for glitches.
When you say ic_n
is being used to reset signals in the clk
domain, are you referring to these?
OKX 590 reset_sync.r0:R clk clk inputs ( 1 x ic_n )
OKX 591 reset_sync.r1:R clk clk inputs ( 1 x ic_n )
OKX 592 reset_sync.reset:R clk clk inputs ( 1 x ic_n )
These are the reset synchronizer. All internal logic otherwise driven off clk
is using the output of the reset synchronizer for reset.
@chili-chips:
cdc_snitch was based on "...the weird not-bound-to-physical-reality generic chain...".
OK, maybe "weird" is a poor choice of words. It is the default yosys techmap
step, with optimizations carefully configured to make sure dual-port memories stay whole. This is a fine and correct logical synthesis, and demonstrated useful to detect CDC anti-patterns. It's not tuned and validated for a specific technology, e.g., ASIC or 4-LUT. Please read cdc_snitch_proc.ys
itself, and the yosys techmap docs for more info.
@gtaylormb :
When you say ic_n is being used to reset signals in the clk domain, are you referring to these? OKX 590 reset_sync.r0:R clk clk inputs ( 1 x ic_n ) OKX 591 reset_sync.r1:R clk clk inputs ( 1 x ic_n ) OKX 592 reset_sync.reset:R clk clk inputs ( 1 x ic_n ) These are the reset synchronizer. All internal logic otherwise driven off clk is using the output of the reset synchronizer for reset.
Got it. Thanks for the explanation!
Repeating my (new) understanding back to you: ic_n
is constructed (outside chip.v) as synchronous to clk_host
. So the 27 OKX above found in afifo
are really OK1. Then the three remaining entries in reset_sync
are intentional OKX in a dedicated CDC circuit.
The defect in cdc_snitch is that it has no mechanism to mark reset inputs with its magic_cdc
attribute -- only data inputs. That behavior should be documented. Or changed, if you think of a clear way to use attributes to describe the situation.
@ldoolitt - Commercial CDC tools have a facility (typically constraints in Tcl format) for individual assignment of clock domain to every input port. In other words, if you are thinking of an enhancement (change for better), there is a well paved path to follow.
@pu-cc it is interesting that Yosys techmap strategy selected by CologneChip has resulted in 2842 DFFs, whereas Larry's ("...with optimizations carefully configured to make sure dual-port memories stay whole ... not tuned and validated for a specific technology, e.g., ASIC or 4-LUT...") resulted in the mere 1635(!)
The first knee-jerk reaction is that the CologneChip tuning of Yosys is costing OPL3 FPGA 1207 DFFs. But, we are sure there is more to it. What are we missing?
@pu-cc it is interesting that Yosys techmap strategy selected by CologneChip has resulted in 2842 DFFs, whereas Larry's ("...with optimizations carefully configured to make sure dual-port memories stay whole ... not tuned and validated for a specific technology, e.g., ASIC or 4-LUT...") resulted in the mere 1635(!)
You will find the answer in the synthesis log in https://github.com/chili-chips-ba/openCologne/issues/26#issuecomment-2254210868.
Memory mapping occurs in pass no. 33.28. Block RAMs in FPGAs typically only feature synchronous reads. For asynchronous reads, LUT-RAM is required. This is not a feature in GateMate, and so the LUT-RAM is āemulatedā using FFs.
[...]
33.28. Executing MEMORY_LIBMAP pass (mapping memories to cells).
using FF mapping for memory opl3.$flatten\channels.\control_operators.\operator.\envelope_generator.\ksl_add_rom.$auto$proc_rom.cc:155:do_switch$2642
mapping memory opl3.$flatten\channels.\control_operators.\operator.\phase_generator.\exp_lut_inst.$auto$proc_rom.cc:155:do_switch$2634 via $__CC_BRAM_TDP_
mapping memory opl3.$flatten\channels.\control_operators.\operator.\phase_generator.\log_sine_lut_inst.$auto$proc_rom.cc:155:do_switch$2646 via $__CC_BRAM_TDP_
using FF mapping for memory opl3.channels.ch_abcd_cnt_mem.bankgen[0].genblk1.mem_bank.ram
using FF mapping for memory opl3.channels.ch_abcd_cnt_mem.bankgen[1].genblk1.mem_bank.ram
[...]
@chili-chips-ba
a facility (typically constraints in Tcl format) for individual assignment of clock domain to every input port
It's conceptually easy for cdc_snitch
to take in such a mapping. Tcl format wouldn't be my first choice.
The existing documented strategy is to have an additional layer of Verilog that pins clock domains of inputs and outputs. In your use case, where only one of nine input ports needs special handling, supporting some kind of special-case map would certainly simplify the flow.
Feel free to supply a patch, or propose an input syntax.
Additional layer of user-supplied Verilog is impractical for the IP-style blocks, i.e. the ones we don't directly own. It is also burdensome.
A much better approach would be to add another Python script for: 1) reading standard XDC/SDC constraints to extract clock domain, async timing groups and Input-to-Clock affiliation from 2) and converting them to whatever format _cdcsnitch works with internally for specifying that info
That is exactly how commercial CDC/RDC tools operate....
@ldoolitt any follow on thoughts from your side on how you'd like to approach the question of preparing _cdcsnitch constraints?!
@ldoolitt please advise if you are OK with closing this issue.
https://github.com/YosysHQ/yosys/issues/4493#issuecomment-2245725950 https://github.com/BerkeleyLab/Bedrock/blob/master/build-tools/cdc_snitch.md