Closed vvuk closed 1 year ago
The first thing I would do is disable ALU generation in yosys by adding the "-noalu" key to the synth_gowin command. Sort of:
yosys Screen -p 'ghdl screen.vhd; synth_gowin -noalu -json screen.json'
!!! I apologise for any possible inaccuracies in the above commands. I have no experience with GHDL and its port to DragonFlyBSD is not expected in the near future:)
Have you tried a post synthesis simulation? That could help identify if it's a synthesis bug or an Apicula bug.
Other possible things to try:
It might also help if you could provide a dump of the code after running ghdl for those that don't have it installed
All great suggestions, thank you! I'll try to get to them in the next few days and need to dig out my ice40 :)
Have you tried a post synthesis simulation
Tried to do this quick with:
yosys -q -p 'plugin -i ghdl; ghdl --std=08 screen.vhd -e Screen; sim -vcd foo.vcd -clock clk -n 2000 Screen'
but the vcd I get out has a clock, and all other signals going undefined after 30ns (no idea where this number comes from either; my testbench isn't included because it can't be synthesized):
is there something I'm doing wrong?
Looks like -nodffe -noalu
to synth_gowin
fixes the issue. Either one by itself does not.
Attached is a json netlist file post-ghdl (via yosys write_json
) and verilog (via write_verilog
). I've verified that doing
yosys -q -p 'read_json post_ghdl.json; synth_gowin -top Screen -json out/screen.json'
reproduces the bug, and adding -nodffe -noalu
fixes it.
(Interestingly -- using the written verilog & read_verilog
instead, the bug is reproduced; but if I add nodffe/noalu, something else happens: I see the correct number (22) of values written via SPI, but the values themselves are garbage. Without nodffe/noalu, the values are correct (+ extra byte bug).)
To be clear, "the bug":
sclk
/sdin
are a SPI connection. There should be 22 8-bit values being written, starting with 0xae
and ending with 0xaf
. Instead, when the bug is present, 23 values are written; the correct 22, and then a final incorrect 0xae
.
Great:( That's the <, right? Can we have same files with /= ?
Yep, attached (and confirmed working with the json; didn't try the verilog).
Thank you.
The error cured with -noalu and especially -nodffe is an insidious thing that occurs unfortunately on constructions like this with thousands of cells. So you can write these flags permanently because in the near future, unless someone has an out-of-the-box epiphany, the chances of fixing it are slim. But we'll do our best.
Alas. Usually this is the kind of bug/problem I love diving into to help, but in this particular instance I'm missing so much knowledge, esp on the EE basics :/
No EE required on this beyond HDL experience, and we could use a fresh perspective I think. What would really help to start with is if we can somehow find a small reproducer for this issue.
For example, you could try reducing the width, number of bytes, and number of states. If it still happens when sending 2 single bits rather than 20 bytes, that's a lot less stuff to look at.
Our best theory is that when a design gets crowded, and the routing for the enable line gets too long, timing issues occur. But it's not 100% certain that's what's actually happen since it only happens so far on really big designs. Maybe it's not timing, maybe there is just some type of routing that doesn't work right, or some dff configuration that isn't set up right. If it's timing you'd expect there be cases where it's unstable, sometimes works sometimes doesn't, which we have not seen so far.
A really unorthodox test would be if you find that for say 5 bits it always breaks and 4 bits it seems to work... put a hair drier on the FPGA to warm it up and see if you can break it again.
Once we have the smallest possible design that breaks, then the part comes where you look really close at all the bits in the bitstream and see if they make sense and tweak them a bit to see if you can change the problem. Like, you could force the enable to 1 by tweaking the bitstream and see if that changes the expected behavior.
Yeah, I spent an hour or two yesterday trying to reduce the testcase, reducing bit count, removing a lot of the other signals etc., but I kept running into other strange issues. For example, at one point, my cs
output somehow got tied to my sclk
output on device. cs pin was mirroring sclk. The HDL was correct, and it simulated correctly, but behaved incorrectly on device. I still haven't dug out my ice40; I'd like to have some other hardware to compare against. I'll keep poking at it; going to try to remove all the SPI signals and just see if I can get it to reproduce as an async serial stream.
Though one thing surprises me -- I wouldn't have thought my sample was a "really big design", but I also don't have a great mental model of what HDL maps to what in the hardware. I guess ~1k cells doesn't seem out of line for this?
Yea it's not "really big" but bigger than a simple binky and bigger than you can realistically dissect by hand. Though it's probably possible to optimize the design a bit. One thing I've found tends to help quite a bit is not to nest conditional statements, but assign their results to signals and only branch on single bits.
Hmmm, well if you find small reproducers for different issues this is of course also welcome. Or are they different manifestations of the same issue, which would go away with the -nodffe flag?
One thing that caused bugs in the past was that DFFs come in pairs that share some inputs and properties, and we had cases where DFFs of different types would end up in the same slice which would create wrong behavior. But hopefully that's been fixed...
Another fun thing we could implement is a debug flag for nextpnr that would only allow one DFF per slice. That could be used to rule out bugs from packing.
FYI we have a Marix/IRC channel that could be a good place to discuss debugging tactics and ask questions and such.
Simplified test; removed the ROM, values are just generated. The test is parameterized (bit number etc), and behaviour is different with different numbers of bits. It fails with 8 bits and 4 bits, succeeds with 2 bits. Uploading here to preserve the test. (VHDL, apologies; at some point I can try to convert to verilog)
Can you please check with the latest nextpnr and apicula?
I did also encounter issues with comparisons and started to investigate this problem. It seems to be caused by the synthesis step and is reproducible in post synthesis simulation. I've created a simple design to show the problem in this Gist: https://gist.github.com/rfuest/b7bdeb130141af310edf771204365f93
Running the simulation shows that the equal comparisons are broken in the design with ALU primitives:
Changing arith_map.v
to be more like the ice40
version seems to fix the issue: https://gist.github.com/rfuest/c9fc736d73177f19ec576f40d2a0a6d9
I believe the problem is that the BI
parameter is supposed to calculate the ones' complement of B
, but setting the ALU to subtract uses two's complement. But I'm not very familiar with the code and this is only speculation.
interesting ...
--- a/techlibs/gowin/arith_map.v
+++ b/techlibs/gowin/arith_map.v
@@ -47,16 +47,15 @@ module _80_gw1n_alu(A, B, CI, BI, X, Y, CO);
(* force_downto *)
wire [Y_WIDTH-1:0] AA = A_buf;
(* force_downto *)
- wire [Y_WIDTH-1:0] BB = B_buf;
+ wire [Y_WIDTH-1:0] BB = BI ? ~B_buf : B_buf;
(* force_downto *)
wire [Y_WIDTH-1:0] C = {CO, CI};
genvar i;
generate for (i = 0; i < Y_WIDTH; i = i + 1) begin:slice
- ALU #(.ALU_MODE(2)) // ADDSUB I3 ? add : sub
+ ALU #(.ALU_MODE(0))
alu(.I0(AA[i]),
.I1(BB[i]),
- .I3(~BI),
.CIN(C[i]),
.COUT(CO[i]),
.SUM(Y[i])
Oof that's quite interesting. You can see our reverse-engineered lookup table of what each ALU mode does:
https://github.com/YosysHQ/apicula/blob/master/doc/alu.md
In particular: ADDSUB(2) A XOR B XOR !D
So I would expect this to do exactly the same: invert B, so we could debug why this happens: instantiate an ALU in ADDSUB mode and map out its truth table, compared to the patched solution.
It might save a little logic to do the ADDSUB thing, but I'd be fine with this change if nobody figures out how to fix ADDSUB. I might have time for it but who knows.
I did take another look and the problem actually seems to be much simpler than expected and has nothing to do with the ALU
primitive. The X
output depends on the value of BB
and in the current version of arith_map.v
BB
isn't inverted if BI
is 1
, which also makes the X
output invalid.
Ah phew, so the ALU is fine, but we basically need assign X = AA ^ BB ^ BI;
? Will you make a Yosys PR for this?
Will you make a Yosys PR for this?
Done.
Ah phew, so the ALU is fine,...
Sorry for the scare :wink:
..., but we basically need
assign X = AA ^ BB ^ BI;
?
Yes, basically. The only change I've made is to extend BI
to Y_WIDTH
, because AA
and BB
are vectors.
The fixes are merged into yosys
Using OSS CAD Suite, with:
the attached simple SPI display module ends up sending one additional byte during initialization (the first byte of the initialization sequence is sent one more time at the end) if the condition on line 142 is:
if the
<
is changed to/=
there is no issue. This does not reproduce in simulation (simple testbench also attached). It also does not reproduce with Gown's tools (same source, with the less-than comparison<
). Observed on device using a logic analyzer.I'm not sure if this bug report belongs here,
yosys
,nextpnr
, or elsewhere; but given the behaviour difference compared to Gowin's tools I figured it's worth filing.screen_tb.vhd.txt screen.vhd.txt tangnano9k.cst.txt
I've attached the cst file in use as well to reproduce; the pins in question are also listed below, and no display should be needed to observe:
IO_LOC "io_cs" 36; IO_LOC "io_dc" 39; IO_LOC "io_rst" 25; IO_LOC "io_sdin" 26; IO_LOC "io_sclk" 27;
which are 2 down from the top left at: