Possible comparator bug in simple screen module, Tang Nano 9K

vvuk commented 1 year ago

Using OSS CAD Suite, with:

Yosys 0.24+1 (git sha1 7ad7b550c, x86_64-apple-darwin20.2-clang 10.0.0-4ubuntu1 -fPIC -Os)
Apycula              0.5.2.dev3+g16a91ef
GHDL 3.0.0-dev (2.0.0.r515.g780165a75) [Dunoon edition]

the attached simple SPI display module ends up sending one additional byte during initialization (the first byte of the initialization sequence is sent one more time at the end) if the condition on line 142 is:

                    if init_byte_index < oled_init'length then

if the < is changed to /= there is no issue. This does not reproduce in simulation (simple testbench also attached). It also does not reproduce with Gown's tools (same source, with the less-than comparison <). Observed on device using a logic analyzer.

I'm not sure if this bug report belongs here, yosys, nextpnr, or elsewhere; but given the behaviour difference compared to Gowin's tools I figured it's worth filing.

screen_tb.vhd.txt screen.vhd.txt tangnano9k.cst.txt

I've attached the cst file in use as well to reproduce; the pins in question are also listed below, and no display should be needed to observe:

IO_LOC "io_cs" 36; IO_LOC "io_dc" 39; IO_LOC "io_rst" 25; IO_LOC "io_sdin" 26; IO_LOC "io_sclk" 27;

which are 2 down from the top left at:

yrabbit commented 1 year ago

The first thing I would do is disable ALU generation in yosys by adding the "-noalu" key to the synth_gowin command. Sort of:

yosys Screen -p 'ghdl screen.vhd; synth_gowin -noalu -json screen.json'

!!! I apologise for any possible inaccuracies in the above commands. I have no experience with GHDL and its port to DragonFlyBSD is not expected in the near future:)

pepijndevos commented 1 year ago

Have you tried a post synthesis simulation? That could help identify if it's a synthesis bug or an Apicula bug.

Other possible things to try:

try the program on an ecp5 or ice40 FPGA
sythesize the code with yosys and use gowin ide for PnR
try the -nodffe option in addition to the -noalu yrabbit suggested

It might also help if you could provide a dump of the code after running ghdl for those that don't have it installed

vvuk commented 1 year ago

All great suggestions, thank you! I'll try to get to them in the next few days and need to dig out my ice40 :)

vvuk commented 1 year ago

Have you tried a post synthesis simulation

Tried to do this quick with:

yosys -q -p 'plugin -i ghdl; ghdl --std=08 screen.vhd -e Screen; sim -vcd foo.vcd -clock clk -n 2000 Screen'

but the vcd I get out has a clock, and all other signals going undefined after 30ns (no idea where this number comes from either; my testbench isn't included because it can't be synthesized):

is there something I'm doing wrong?

vvuk commented 1 year ago

Looks like -nodffe -noalu to synth_gowin fixes the issue. Either one by itself does not.

Attached is a json netlist file post-ghdl (via yosys write_json) and verilog (via write_verilog). I've verified that doing

yosys -q -p 'read_json post_ghdl.json; synth_gowin -top Screen -json out/screen.json'

reproduces the bug, and adding -nodffe -noalu fixes it.

post_ghdl.zip

(Interestingly -- using the written verilog & read_verilog instead, the bug is reproduced; but if I add nodffe/noalu, something else happens: I see the correct number (22) of values written via SPI, but the values themselves are garbage. Without nodffe/noalu, the values are correct (+ extra byte bug).)

To be clear, "the bug": sclk/sdin are a SPI connection. There should be 22 8-bit values being written, starting with 0xae and ending with 0xaf. Instead, when the bug is present, 23 values are written; the correct 22, and then a final incorrect 0xae.

yrabbit commented 1 year ago

Great:( That's the <, right? Can we have same files with /= ?

vvuk commented 1 year ago

Yep, attached (and confirmed working with the json; didn't try the verilog).

post_ghdl_good.zip

yrabbit commented 1 year ago

Thank you.

Number of cells: 853

The error cured with -noalu and especially -nodffe is an insidious thing that occurs unfortunately on constructions like this with thousands of cells. So you can write these flags permanently because in the near future, unless someone has an out-of-the-box epiphany, the chances of fixing it are slim. But we'll do our best.

vvuk commented 1 year ago

Alas. Usually this is the kind of bug/problem I love diving into to help, but in this particular instance I'm missing so much knowledge, esp on the EE basics :/

pepijndevos commented 1 year ago

No EE required on this beyond HDL experience, and we could use a fresh perspective I think. What would really help to start with is if we can somehow find a small reproducer for this issue.

For example, you could try reducing the width, number of bytes, and number of states. If it still happens when sending 2 single bits rather than 20 bytes, that's a lot less stuff to look at.

Our best theory is that when a design gets crowded, and the routing for the enable line gets too long, timing issues occur. But it's not 100% certain that's what's actually happen since it only happens so far on really big designs. Maybe it's not timing, maybe there is just some type of routing that doesn't work right, or some dff configuration that isn't set up right. If it's timing you'd expect there be cases where it's unstable, sometimes works sometimes doesn't, which we have not seen so far.

A really unorthodox test would be if you find that for say 5 bits it always breaks and 4 bits it seems to work... put a hair drier on the FPGA to warm it up and see if you can break it again.

Once we have the smallest possible design that breaks, then the part comes where you look really close at all the bits in the bitstream and see if they make sense and tweak them a bit to see if you can change the problem. Like, you could force the enable to 1 by tweaking the bitstream and see if that changes the expected behavior.

vvuk commented 1 year ago

Yeah, I spent an hour or two yesterday trying to reduce the testcase, reducing bit count, removing a lot of the other signals etc., but I kept running into other strange issues. For example, at one point, my cs output somehow got tied to my sclk output on device. cs pin was mirroring sclk. The HDL was correct, and it simulated correctly, but behaved incorrectly on device. I still haven't dug out my ice40; I'd like to have some other hardware to compare against. I'll keep poking at it; going to try to remove all the SPI signals and just see if I can get it to reproduce as an async serial stream.

Though one thing surprises me -- I wouldn't have thought my sample was a "really big design", but I also don't have a great mental model of what HDL maps to what in the hardware. I guess ~1k cells doesn't seem out of line for this?

pepijndevos commented 1 year ago

Yea it's not "really big" but bigger than a simple binky and bigger than you can realistically dissect by hand. Though it's probably possible to optimize the design a bit. One thing I've found tends to help quite a bit is not to nest conditional statements, but assign their results to signals and only branch on single bits.

Hmmm, well if you find small reproducers for different issues this is of course also welcome. Or are they different manifestations of the same issue, which would go away with the -nodffe flag?

One thing that caused bugs in the past was that DFFs come in pairs that share some inputs and properties, and we had cases where DFFs of different types would end up in the same slice which would create wrong behavior. But hopefully that's been fixed...

Another fun thing we could implement is a debug flag for nextpnr that would only allow one DFF per slice. That could be used to rule out bugs from packing.

FYI we have a Marix/IRC channel that could be a good place to discuss debugging tactics and ask questions and such.

vvuk commented 1 year ago

Simplified test; removed the ROM, values are just generated. The test is parameterized (bit number etc), and behaviour is different with different numbers of bits. It fails with 8 bits and 4 bits, succeeds with 2 bits. Uploading here to preserve the test. (VHDL, apologies; at some point I can try to convert to verilog)

looptest.vhd.txt

yrabbit commented 1 year ago

Can you please check with the latest nextpnr and apicula?

rfuest commented 1 year ago

I did also encounter issues with comparisons and started to investigate this problem. It seems to be caused by the synthesis step and is reproducible in post synthesis simulation. I've created a simple design to show the problem in this Gist: https://gist.github.com/rfuest/b7bdeb130141af310edf771204365f93

Running the simulation shows that the equal comparisons are broken in the design with ALU primitives: grafik

rfuest commented 1 year ago

Changing arith_map.v to be more like the ice40 version seems to fix the issue: https://gist.github.com/rfuest/c9fc736d73177f19ec576f40d2a0a6d9

I believe the problem is that the BI parameter is supposed to calculate the ones' complement of B, but setting the ALU to subtract uses two's complement. But I'm not very familiar with the code and this is only speculation.

yrabbit commented 1 year ago

interesting ...

--- a/techlibs/gowin/arith_map.v
+++ b/techlibs/gowin/arith_map.v
@@ -47,16 +47,15 @@ module _80_gw1n_alu(A, B, CI, BI, X, Y, CO);
    (* force_downto *)
    wire [Y_WIDTH-1:0]  AA = A_buf;
    (* force_downto *)
-   wire [Y_WIDTH-1:0]  BB = B_buf;
+   wire [Y_WIDTH-1:0]  BB = BI ? ~B_buf : B_buf;
    (* force_downto *)
    wire [Y_WIDTH-1:0]  C = {CO, CI};

    genvar              i;
    generate for (i = 0; i < Y_WIDTH; i = i + 1) begin:slice
-      ALU #(.ALU_MODE(2)) // ADDSUB I3 ? add : sub
+      ALU #(.ALU_MODE(0))
       alu(.I0(AA[i]),
          .I1(BB[i]),
-         .I3(~BI),
          .CIN(C[i]),
          .COUT(CO[i]),
          .SUM(Y[i])

pepijndevos commented 1 year ago

Oof that's quite interesting. You can see our reverse-engineered lookup table of what each ALU mode does:

https://github.com/YosysHQ/apicula/blob/master/doc/alu.md

In particular: ADDSUB(2) A XOR B XOR !D

So I would expect this to do exactly the same: invert B, so we could debug why this happens: instantiate an ALU in ADDSUB mode and map out its truth table, compared to the patched solution.

It might save a little logic to do the ADDSUB thing, but I'd be fine with this change if nobody figures out how to fix ADDSUB. I might have time for it but who knows.

rfuest commented 1 year ago

I did take another look and the problem actually seems to be much simpler than expected and has nothing to do with the ALU primitive. The X output depends on the value of BB and in the current version of arith_map.v BB isn't inverted if BI is 1, which also makes the X output invalid.

pepijndevos commented 1 year ago

Ah phew, so the ALU is fine, but we basically need assign X = AA ^ BB ^ BI;? Will you make a Yosys PR for this?

rfuest commented 1 year ago

Will you make a Yosys PR for this?

Done.

Ah phew, so the ALU is fine,...

Sorry for the scare :wink:

..., but we basically need assign X = AA ^ BB ^ BI;?

Yes, basically. The only change I've made is to extend BI to Y_WIDTH, because AA and BB are vectors.

yrabbit commented 1 year ago

The fixes are merged into yosys

YosysHQ / apicula

Possible comparator bug in simple screen module, Tang Nano 9K #140

Number of cells: 853