MJoergen / C64MEGA65

Commodore 64 core for the MEGA65 based on the MiSTer FPGA C64 core
GNU General Public License v3.0
23 stars 4 forks source link

HyperRAM timing issue #132

Closed MJoergen closed 4 months ago

MJoergen commented 5 months ago

Commit 7ea592b shows no critical warnings and no timing errors. However, the image fails for the GitHub user @Schwefelholz . The symptom is no image / vertical stripes shown on HDMI output, which indicates a lock-up in the HyperRAM controller.

Furthermore, on my own R3 machine I see a different symptom: Blinking pixels on the HDMI output, which indicates random bit errors in the data path.

Additionally, some builds (where unrelated parts are changed) seem to work, and others don’t.

Note: I’m still using PBLOCK for the HyperRAM. When I build without PBLOCK the design sometimes fails with the same symptoms as above.

So, all-in-all I suspect the HyperRAM controller is under-constrained.

My hypothesis is that the problem is in the receive path (from HyperRAM to FPGA) because this is the most complicated (i.e. least understood) part of the HyperRAM controller.

The two main files involved are hyperram_io.vhd and common.xdc. These two files combined handle all the Clock Domain Crossings. Outside the file hyperram_io.vhd everything is synchronuous to the 100 MHz hr_clk_x1.

MJoergen commented 5 months ago

Here are some observations based on the Implemented Design of this (failing) commit 7ea592b .

First, the 16 nets hr_dq_in[*] (corresponding to assignment of ctrl_dq_ddr_in in line 220 in hyperram_io.vhd) are shown here:

image

These 16 nets are all Clock Domain Crossings (from hr_rwds to hr_clk_x1). I find it troubling that the paths are long and uneven. This should be constrained better. Part of the problem I believe is the set_clock_groups constraint in common.xdc lines 30-33, where the clocks hr_rwds and hr_clk_x1 are deemed asynchronous.

These long uneven paths could probably explain the blinking pixels I see on my machine. However, this alone does not explain the lockup of the HyperRAM controller, because the lockup is not dependent on the data path.

Second, the net hr_rwds_in_delay is shown here:

image

This signal (hr_rwds_in_delay) is used as clock signal for the 8 IDDR blocks. However, it’s also used as clock for the signal hr_toggle, and also as data input (i.e. Clock Domain Crossing) to the signal ctrl_rwds_in. One possible hypothetical source of the HyperRAM lockup is if this signal is not sampled correctly.

Specifically, the flip-flop ctrl_rwds_in_reg is the end of one of the long arrows in the picture above, and hence has a long path too. The detailed path delay report by Vivado is (note the propagation delay through the IDELAYE2 of 2.474 ns and the final net delay from IDELAYE2 to FDRE : 0.954 ns):

image

image

The corresponding path in device view:

image

sy2002 commented 5 months ago

@amb5l FYI - please read Michael's "diary" and please feel free to share any thoughts that come to mind.

sy2002 commented 5 months ago

Thoughts from @amb5l via Skype:

Michael’s reasoning - suspecting the read path - is sound IMO. Perhaps the HyperRAM timing simulation model is limited and doesn’t produce a range of HR_CK to HR_RWDS delays…?

Another random thought: have you looked closely at when the drive direction on RWDS changes? Is any clkx1 synchronous logic completely prevented from seeing RWDS transitions during writes?

MJoergen commented 5 months ago

To proceed, I first remove the set_clock_groups constraint (common.xdc lines 30-33). This gives 1 setup violation (from hr_rwds to hr_clk_x1) and 1 hold violation (from hr_rwds to hr_ck).

Since the latter hold violation is probably harmless (because the clocks are mostly unrelated), I then add the line: set_clock_groups -name cg_async -asynchronous -group [get_clocks hr_ck] -group [get_clocks hr_rwds]

This leaves me (in commit 9a6a9a1) with a single setup violation (from hr_rwds to hr_clk_x1) of -0.239 ns for the input to ctrl_rwds_in_reg:

image image

However, notice in the above timing report that the net delay from IDELAYE2 to FDRE is now 3.703 ns. Furthermore, the propagation delay within the IDELAYE2 is now reduced to 2.037 ns. Here, I'm mostly concerned with the increased net delay.

The following picture shows that Vivado deliberately chooses a long round-about route.

image

This is not desired. I would like the path to be as short as possible.

I initially thought the long path is to satisfy the hold time, but surprisingly, there is a lot of slack on the hold-time requirement:

image image

The 16 nets hr_dq_in[*] (corresponding to assignment of ctrl_dq_ddr_in in line 220 in hyperram_io.vhd) are shown here:

image

So, they appear to be placed (in this build) somewhat closer to the IDDR buffers.

Timing report for one of these paths is:

image image

MJoergen commented 5 months ago

From the simulation model (in https://github.com/MJoergen/HyperRAM) I generate the following images showing the signals at the HyperRAM device. In the testbench I'm assuming a 1 ns propagation delay from FPGA to HyperRAM and back.

The first image shows a write: image

The second image shows a read: image

In the last image, the time from rising edge of hr_ck to rising edge of hr_rwds (here 4 ns) is a parameter that can be adjusted in the simulation model.

amb5l commented 5 months ago

Looking at the latest commit (9a6a9a): 1) hyperram_io.vhd contains VHDL blocks. I must confess that I have never seen these before! Does Vivado synthesize them correctly? 2) It appears that the path from the FPGA HyperRAM RWDS pin to the HyperRAM controller system clock domain looks like this: in hyperram.vhd:

MJoergen commented 5 months ago

@amb5l :

  1. It seems Vivado interprets block's the same way I interpret them :smile: In other words, it works for me. I quite like the possibility of scoping "local" signals.
  2. I tried with an XPM FIFO, but that did not work. I found the reason to be that the XPM FIFO requires a free-running clock, but the RWDS signal from the HyperRAM is not free-running. So I'll have to "roll my own" FIFO. However, I thought our timing analysis ensured that setup-and-hold requirements are satisfied on this path. Specifically, that the hr_rwds_in_delay into the flip-flop is stable around the rising edge of the hr_clk_x1.
  3. Agree.
amb5l commented 5 months ago

2. I tried with an XPM FIFO, but that did not work. I found the reason to be that the XPM FIFO requires a free-running clock, but the RWDS signal from the HyperRAM is not free-running. So I'll have to "roll my own" FIFO. However, I thought our timing analysis ensured that setup-and-hold requirements are satisfied on this path. Specifically, that the hr_rwds_in_delay into the flip-flop is stable around the rising edge of the hr_clk_x1.

There are 2 primary drivers of variability in the arrival time of RWDS edges during the read data phase:

I don't think timing analysis can model this; IIRC this is why we decided to pretend that hr_rwds is asynchronous, up to and including adding a FIFO for the data it is clocking.

I don't see any way for static timing analysis to look at the relationship between hr_rwds_in_delay and hr_clk_x1.

I believe that you will need to create a small, shallow FIFO for read data, and an improved CDC circuit to generate ctrl_rwds_in.

sy2002 commented 4 months ago

@MJoergen Can we close this one?

MJoergen commented 4 months ago

Yes. Indeed, I have exactly implemented

I believe that you will need to create a small, shallow FIFO for read data, and an improved CDC circuit to generate ctrl_rwds_in.

as suggested in the previous comment.

sy2002 commented 4 months ago

@amb5l As you can see above - you again nailed it! Thank you for that :-)