Have you experienced odd instability issues with Gowin FPGAs / Sipeed Tang Nano 4K, 9K and 20K?

juj commented 1 year ago

Hi Apicula authors,

I'd like to cross-reference your experience regarding a board stability issue that I am seeing to affect Gowin's FPGAs when using Gowin's own tools.

Check out https://github.com/juj/gowin_flipflop_drainer/ and https://www.reddit.com/r/FPGA/comments/101pagf/sipeed_tang_nano_4k_9k_gowin_fpgas_become/ for details.

I am wondering if there have been anything similar happening in your experience?

pepijndevos commented 1 year ago

Very interesting. We have been having a particularly persistent issue with bigger designs breaking down, but I doubt it's related.

An obvious test would be to run the flipflop drainer on Apicula and see what happens.

It's also possible to synthesize a design with yosys and pnr with vendor tools. In particular I'd be curious to see what happens when synthesizing with yosys with the -nodffe -noalu options but PnRing with the vendor tools rather than Apicula.

In our particular issue, these options improve reliability a lot, and we don't know why. We suspect some timing problem or very insidious PnR bug, so it'd be interesting to see if these options have any effect on vendor tools.

For the very adventurous, there are also these Nextpnr alpha and beta options to tweak that adjust the density of PnR, which might help to prove the interference theory. I have seen some weak evidence that messing with those sometimes makes the failure go away.

But note that in our case they are pretty hard failures, not occasional glitches. Still, who knows if there is a connection.

juj commented 1 year ago

Hi @pepijndevos , so great to read your insights.

We have been having a particularly persistent issue with bigger designs breaking down, but I doubt it's related.

This description actually matches what I am seeing in our real project quite closely. We have a project that works seemingly well when the FPGA utilization is around 30%-40%, but when including more sub-components that increase to higher utilization, things begin to fall apart, even when timing closure should be achieved.

When all of our features are active, we have about 90% utilization rate of the FPGA, and even though we should have timing closure, the code is still wholly unstable, and we have been struggling to find fault in our timing constraints, or other electrical design or other considerations. Removing most of the nonessential features, and just running with any one of the subfeatures active, we find that subfeature to be stable and pass testing - but just are unable to activate all of them at the same time.

Video sync stability is the main/most sensitive issue that we are seeing, but other aspects of FPGA computation are also failing, not just as often/immediately as video sync deteriorates.

Typically making random trivial one-liner changes more or less anywhere in the project verilog files gives a random chances of the build to start working.

The gowin_flipflop_drainer repository has been my best attempt so far to capture the issue out to a reproducible repository. I used a HDMI output as a test since that is where we most easily see the issues to happen.

There is this one interesting behavior I have repeatedly observed, which I find to be counterintuitive and cannot currently explain.

The hypothesis we got from Gowin was that that the flip flops in the FPGA would be generating noise in the clock route of the chip, which then in turn would manifest in clock jitter. And the solution would be to utilize set_clock_uncertainty timing constraint to account for this kind of extra timing jitter.

However, in our tests, we found that neither a) playing with set_clock_uncertainty or b) "over-optimizing" the design timing constraints (meaning e.g. that we would set timing constraints to close at e.g. 200 MHz even if we would run at only 100 MHz) have any help.

In fact, when eyeballing, the achieved timing Fmax does not seem to correlate at all with the issues. If we have +100MHz of timing slack, the design might be faulty in stress tests, and also even the opposite, sometimes having a git commit with a negative slack of -20MHz or so, we might have passing stress tests. I.e. timing analysis does not seem to correlate with the issue.

But the one thing we repeatedly have found to correlate with improved stability, is to minimize the amount of FPGA resources used, but even more specifically, minimize the number of flip flop registers that are used, at the expense of achieving worse timing.

As an example story, in https://github.com/juj/gowin_flipflop_drainer/blob/main/src/hdmi.v#L1-L221 I implement this massive 19 clock cycle long pipeline that performs TMDS encoding. This encoding runs at the fastest video pixel clock speeds. It is unit tested with cocotb against the reference TMDS encoding function implementation, written in C++ against the DVI-D spec PDF pafe 29 to test it provides the correct TMDS output on all possible inputs. It is excessively pipelined to provide as much timing slack as possible on the Sipeed Tang Nano 9K with Gowin speed grade C6/I5.

This implementation has not worked out particularly stable. It is included in that gowin_flipflop_drainer where it shows. But it performs well in timing closure in Gowin's timing analyzer reports.

Then we got a hold of Gowin's faster speed grade C7/I6 chips, and repeated the same tests with that variant. Stability of the design was not improved. However, with the faster speed grade, I was now able to tear down a lot of that long pipelined TMDS encoding, and in the end I shortened it down to just 5 clock cycles from the original 19. Timing analysis would perform much worse at Fmax, although still just fast enough to close our timing constraints. Same cocotb unit tests were run to ensure that the new implementation computed the exact same function.

Surprisingly, in test runs, I found that this 5 clock cycle implementation was much more stable than the original 19 clock cycles one. With this implementation, we actually got our first "completely stripped down from nonessential features"-build to pass all stress tests.

This kind of observation is odd, because it looks like worsening timing behavior by introducing more combinational logic is helping stability, instead of the opposite. Which makes me believe that clock jitter would not be the root problem.

I'd love to try out Apicula, although I am currently on Windows and it seems Apicula would work best on a Linux system? I may have to set one up to give it a go.

Would you be able to say, glancing at the code in https://github.com/juj/gowin_flipflop_drainer/tree/main/src , does it look like the primitives that it currently uses would be expected to work with Apicula e.g. on Sipeed Tang Nano 4K or 9K board?

pepijndevos commented 1 year ago

It's not as heavily tested on Windows, but... should hopefully work. You might give Yowasp or OSS CAD Suite a try for easy installation.

So PLL support is fairly experimental on our end, and some of the IO primitives like DDR and SERDES are also kinda new or not supported. At a glance I did not see immediately what you are using.

But the Yosys+vendor combination should support all the primitives. Not a lot of people use that combination, but it should support all the PLL and IO primitives the vendor knows about. This is the path I'm most interested in seeing if you see any difference with the -nodffe -noalu flags. That would be a strong indicator the bugs are related, and maybe not even entirely our fault.

juj commented 1 year ago

Thanks. Gave OSS Cad Suite a try, though unfortunately I got stuck with some kind of error:

juj commented 1 year ago

At a glance I did not see immediately what you are using.

Gowin_flipflop_drainer is using:

if targeting Tang Nano 4K, a PLLVR block,
if targeting Tang Nano 9K, a rPLL block (this is the Gowin-9 version of PLLVR, almost identical, but just tiny feature changes),
CLKDIV to generate the 1/5 pixel clock domain from the 5x serial HDMI clock domain,
OSER10, to perform 1:5 DDR serialization from 1x video pixel clock to 5x DVI-D serialized pixel clock,
On Tang Nano 4K, TLVDS_OBUF module, to implement LVDS encoding. Although this can be replaced with a ELVDS_OBUF drop-in if desired.
On Tang Nano 9K, ELVDS_OBUF, since Sipeed chose to locate the HDMI LVDS output pins to pins that do not support True LVDS (True LVDS is actually worse LVDS than Emulated LVDS for HDMI output: the voltage levels with TLVDS match worse with the HDMI spec compared to ELVDS, so ELVDS is better for HDMI than TLVDS, despite the name)

All other code is generic Verilog.

yrabbit commented 1 year ago

Unfortunately, OSER10 and the entire OSERx family still only operates on the dying GW1N-1 chip. Support for 4k and 9k is planned to appear not earlier than the middle of March.

CLKDIV and ELVDS - no information about the timing of support, although work in the latter direction was underway.

juj commented 1 year ago

Understood. Thanks for working on it, this has amazing potential!

yrabbit commented 1 year ago

A quick check with this line by compiling and further PnR of the vendor IDE showed that there is no way to put 233 registers anywhere:( yosys -p "read_verilog -sv top.v pll.v hdmi.v flipflop_drainer.v display_signal.v board_config.v; synth_gowin -vout out.vg"

shot-0

yrabbit commented 1 year ago

I noticed that you place registers in IO cells. Have you tried turning that off? While researching OSERx I noticed that the high frequency for these primitives is delivered by special wires that run along the sides of the chip just across all the IO blocks, and maybe if you remove the registers from there...

Just kidding:) shot-1

juj commented 1 year ago

A quick check with this line by compiling and further PnR of the vendor IDE showed that there is no way to put 233 registers anywhere:(

The workload is tuned to maximize utilization achieved on Gowin 1.9.8.10 on Windows. Gowin attempts to be deterministic with its PnR in a specific version, not sure what are the conditions that would cause it to pass or fail.

You can adjust the workload size smoothly by modifying one line of code at https://github.com/juj/gowin_flipflop_drainer/blob/main/src/flipflop_drainer.v#L7-L19

Try e.g. replacing that with out <= ^a800; or out <= ^a700;.

See the paragraph in README:

In flipflop_drainer.v one can vary the number of nonsense adders that is used in the design by adjusting
the register aXXX that is referenced in the addition. There is a smooth ramp of < K adders: stable signal,
K < x < M adders: unstable signal, > M adders: black image, no sync at all, i.e. the more adders, the more
likely the video sync will glitch, and at some point, video sync will vanish completely.

I noticed that you place registers in IO cells. Have you tried turning that off? While researching OSERx I noticed that the high frequency for these primitives is delivered by special wires that run along the sides of the chip just across all the IO blocks, and maybe if you remove the registers from there...

I actually haven't, that will be an interesting test to try. I'll give this a go to see how it behaves.

juj commented 1 year ago

Tried now adjusting the "Place register * to IOB" options, these did not have an effect.

Also gave another go at the Place Option, Route Option, Route Maxfan and Run Timing Driven settings. Those are something that I have tried before, now repeated the test with these, and they don't have an observable effect either.

However, something that I now find that does have an effect is that a few weeks ago I got a new Tang Nano 4K from Sipeed by mail.

It does perform slightly better than the old one.

Old one is a C6/I5 speed grade:

old_nano4k

New one is a C7/I6 speed grade:

new_nano4k

Old Nano 4K starts glitching at 1024x768@70Hz @ 65.88 MHz pixel clock, whereas I see the new Nano 4K to start glitching only at 1280x1024@60Hz @ 108.00 MHz pixel clock.

In both cases timing closure is good, e.g. with a ~+27MHz margin for the C6/I5 speed grade. And in both cases removing the flip_flop_drainer module from the build with line

https://github.com/juj/gowin_flipflop_drainer/blob/f1ec5dc9b3e5ebb605238975dd7eb59f81a5f366/src/flipflop_drainer.v#L18

fixes both boards up so they output a stable video at 1600x1200@57Hz @ 118.80 MHz pixel clock.

(btw there is a separate branch tang_nano_4k in that repository that I created to ease testing on the Nano 4K. The main branch is configured for Nano 9K out of the box)

yrabbit commented 1 year ago

was able to build, "-noalu -nowidelut -nodffe" do not fix the situation. Picture with artifacts. GW1N-9k c6/i5

out <= ^a700;

out-9k.fs.gz out-9k-nodffe-noalu-nowidelut.fs.gz

juj commented 1 year ago

Thanks for testing!

Do you also see the same effect that if you set out <= ^a0; then the video stability issues vanish?

yrabbit commented 1 year ago

Do the images I sent you also show artifacts? Maybe I have a bad cable to the TV :)

I'll try ^a0, it will take some time - there are a certain number of manual edits.

juj commented 1 year ago

Trying these .fs files out on Sipeed Tang Nano 9K, I see that out-9k.fs does not produce a video sync at all, my ASUS ProArt PA248QV display remains black, but out-9k-nodffe-noalu-nowidelut.fs does produce a sync, and most of the image is good, although with individual columns of glitchy flickering pixel noise that repeats maybe every 16 or 32 pixels.

yrabbit commented 1 year ago

I see.

0) We do have some bugs; 1) my TV (not the monitor) is totally unsuitable for testing - it both recovers synchronization and the nature of artifacts is different.

juj commented 1 year ago

I should maybe clarify that the failure I described above (individual vertical stripes of glitching pixels that periodically repeat on different x coordinate columns) I have seen is quite similar to a failure mode I have observed before on this test case, so on my end I see the similar looking problem with these builds as I am seeing when using Gowin's toolchain.

The nodffe-noalu-nowidelut option does seem to do something good for the signal, since it does then at least produce output video, rather than keeping the video completely blank.

edmundhumenberger commented 1 year ago

Official Gowinsemi representative said:

"We have people reviewed this post a while ago. The conclusion is the Tang Nano boards did not properly bring the True LVDS Ios to the HDMI/DVI port. The user case is using an emulated LVDS which is not the best performance IO for such application. When the video resolution increase, they are just not up to the tasks."

juj commented 1 year ago

Thanks for the reply here. This is something that I am well aware, and disappointing to hear that they ignored my follow-up email.

When they wrote their original report to me, they did state "the issue is with TLVDS vs ELVDS", so I diligently tested the effect of True LVDS vs Emulated LVDS, and then reported back to them that actually the behavior of True LVDS in the given test case was found to be even worse than with Emulated LVDS.

I found it surprising that they even brought this up as an "issue". The test case I had provided to them did use TLVDS by default on Tang Nano 4K:

https://github.com/juj/gowin_flipflop_drainer/blob/67c1ae054ca9e180f37fd98c710739d6f0d41e7c/src/hdmi.v#L256-L259

Their report had stated that they had only Tang Nano 4K to test, and the above code I had given to them did use TLVDS and not ELVDS in those tests.

The above code utilizes ELVDS only on Tang Nano 9K, which is because Sipeed has wired 9K in a way that using TLVDS for HDMI output is not possible. (it is not available on the HDMI pins).

From Reddit I have read that the reason that Sipeed would have done did this is that Gowin's TLVDS implementation was found to provide unsuitable voltage levels for HDMI output use cases, that some displays might not be compatible, and ELVDS allowed changing the voltage levels to be more appropriate. We have observed the same voltage level difference in our own tests, but I do not know enough to say how much that would actually affect compatibility.

In any case, I did reply to Gowin's report as a follow-up that the test case does use TLVDS, and actually utilizing ELVDS and not TLVDS was observed in practice to provide better stability on Tang Nano 4K, complete opposite to what their report was claiming - but they never replied back again on any of this.

In summary:

the issue occurs on both TLVDS and ELVDS, i.e. it is independent to which LVDS method is used,
the test case they were given did use TLVDS on Tang Nano 4K like they recommended, and ELVDS only on Tang Nano 9K (since it does not allow TLVDS),
after being prompted, I tested switching Tang Nano 4K to use ELVDS instead of TLVDS, and it actually improved video sync stability instead of worsening it, in contrast to what Gowin stated

I have gotten a silent treatment from Gowin after this, unfortunately. One of their sales representatives did reply briefly afterwards, and suggested that I would try using set_clock_uncertainty to help the issue. I have tested that time and time again in the past year, although I never see it having any effect on the issue, even if I set massive ~1.5ns of clock uncertainty on the signal lines.

As an anecdotal data point:

In our own tests since, I have found some remedy in our own actual design by "blacklisting" certain PLL frequencies. We use a 27 MHz oscillator, and based on input video, generate a varying video output pixel clock frequency between 25 MHz - 118.8 MHz. I.e. the max rated PLL pixel clock for video by Gowin is 118.8 MHz . (I have tried overclocking up to 148.5 MHz).

I find that banning PLL output frequencies between 100.8 - 111.6 MHz helps video signal stability in our case. That is, we only allow >= 113.4 MHz and <= 99.9 MHz video pixel clocks. With that blacklist, we have seen drastically fewer issues in practice.

What is peculiar is that I can overclock the board to 145.8 MHz pixel clock and have it be stable, but then lower the video pixel clock to e.g. 102.6 MHz, and all the signal stability issues come back, depending on random luck.

However I don't know if this "there are suspect PLL frequencies" issue is the same as the https://github.com/juj/gowin_flipflop_drainer/ test case in particular, so I have tried to not conflate this issue with the general conversation/test case repro in juj/gowin_flipflop_drainer.

juj commented 1 year ago

Another communication problem we had was Gowin said that their DVI TX IP Core (IPUG938.pdf) has a timing limitation that it is only specced to work up to 80 MHz, and when they were seeing the issue occur on Tang Nano 4K at 83.7 MHz, they stated that is faster than what the DVI TX IP block would support.

I tried to explain to them that a) the test case does not utilize Gowin's DVI TX IP Core so any timing limitations that IP block might have, do not apply. The test case uses Gowin's OSER10 and LVDS blocks, and their documentation GW1NR series of FPGA products data sheet DS117-2.9.7E, 01/12/2023, rates the OSER10 block up to 120 MHz, and the LVDS block up to 100 MHz, which would be faster than the 83.7 MHz that they acknowledged they were also seeing the issue at, and b) we have also been able to reproduce the to issue occur at slower pixel clocks of 39.96 MHz and 65.88 MHz, both at TLVDS.

but unfortunately we were not able to reach a reply from them afterwards.

edmundhumenberger commented 1 year ago

juj do you see this behaviour also on official gowinsemi development boards? Gowin representative is pretty confident that the board design is faulty. The only way to get Gowinsemi involved again is to demo the fault on their board.

juj commented 1 year ago

I unfortunately do not have one of Gowin's devboards to test. :(

edmundhumenberger commented 1 year ago

Can you get one?

https://www.gowinsemi.com/en/support/devkits/37/

juj commented 1 year ago

I do not at the moment operate under a registered company that would be able to order from electronics wholesalers (I am however looking into organizing that to change in the future).

So Mouser is the only company that serves retail customers, but unfortunately they are giving "restricted availability" to the boards I see to have a HDMI port, and the prices that they do list for some devkits (without a HDMI port) would be too high (177eur and 452 eur).

I could try asking Gowin if they would be able to send me one directly. About a year and a half ago a friend of mine did, to which they politely replied that their policy was not to send devkits to individuals, but maybe this situation would be different.

edmundhumenberger commented 1 year ago

Send me your postal address to office@symbioticeda.com with the board you want to have. I will figure out something.

Which one would you prefer:

http://www.gowinsemi.com.cn/clients_view.aspx?TypeId=21&Id=747 http://www.gowinsemi.com.cn/clients_view.aspx?TypeId=21&Id=709

juj commented 1 year ago

Thanks for the very kind offer. I sent you an email now for a follow-up.

YosysHQ / apicula

Have you experienced odd instability issues with Gowin FPGAs / Sipeed Tang Nano 4K, 9K and 20K? #169