Please clarify ALU doc - Githubissues

stacksmith commented 1 month ago

The ALU writeup in doc folder is very interesting but does not seem match reality... In ALU mode, there seem to be only 3 inputs (I1, I2 and I3), unlike the documented A,B,C and D...

The lack of the 4th input and the severely limited ability to configure the LUT makes me wonder if the Gowin FPGAs really have a hardware carry chain, or is it faked in the LUT?

Which way does the carry chain run, topologically (and what is the best placement of registers to propagate the carry?)

Any information or pointers in the right direction are appreciated.

yrabbit commented 1 month ago

I can answer you only for the wires - they go as it is drawn on the first picture: from left to right along row of the chip from the left edge to the right, and inside each cell from roughly speaking from LUT0 to LUT5.

CIN and COUT wires are uncommutated and you can't connect to them directly, only using ALU.

stacksmith commented 1 month ago

Thanks... I'm trying to figure out how to get inside this.

To a suspicious outside observer it certainly looks as if one of the LUT inputs is used as CIN, and there is some partial output from the LUT (normally hidden) used as COUT... This could actually be useful, if there was a way to fully configure the damn LUT. I'd love to use the carry for incrementing, and the lut as a mux, for instance. Xilinx is pretty good that way.

But it does look like some kind of trickery, and I cannot yet see what is gained by this obfuscation. Perhaps the ability to pretend to have carry for marketing purposes?

yrabbit commented 1 month ago

Well, if you feel the craving for adventures, you can replace the lines https://github.com/YosysHQ/apicula/blob/4f87247fca9d4e50412b6e0fa9a0ffca66046813/apycula/gowin_pack.py#L2256-L2261

with place_lut call. Naturally, taking care so that you have an INIT parameter with the content of LUT. This make gowin_pack switch CFU to ALU mode, but use your LUT contents.

pepijndevos commented 1 month ago

Yeah you totally can program the LUT in any way you like when using it in ALU mode, just not using the vendor primitive. I demonstrate this in the last paragraph of the docs.

The ALU documentation is based on reverse engineering what is actually going on and not on any official documentation, so it is possible we've missed something, but what is described in the docs is very much how we observe it to work. It also closely matches how the Lattice ECP5 ALU works, which is known to have a very similar internal architecture.

I think it is actually possible to use the fourth LUT input in ALU mode, but you have to consider that the lower 4 bits are shared with the second LUT, which is conveniently circumvented by not using the inputs that would use those bits.

I would be open to supporting unofficial ALU modes that have practical applications.

stacksmith commented 1 month ago

yrabbit: Thanks, that is worth considering. Is mode 2 literally ALU mode 2 or does it mean something different?

pepinjdevos: Thank you, that's what I want to hear! How would you go about configuring the LUT and activating ALU? Is there a way without patching gowin_pack, or is yrabbit's patch the only way to go?

WIth all due respect, you do not demonstrate configuring the ALU in the last paragraph of the doc, only state that you have done so. I would love to see a demonstration!

And thank you for your great work!

yrabbit commented 1 month ago

yrabbit: Thanks, that is worth considering. Is mode 2 literally ALU mode 2 or does it mean something different?

As far as I remember, yes. You can see what LUT contents we use for which mode:

https://github.com/YosysHQ/apicula/blob/4f87247fca9d4e50412b6e0fa9a0ffca66046813/apycula/chipdb.py#L347-L380

stacksmith commented 1 month ago

Oh snap! So the modes are not hardwired, and you actually stuff the LUTS! I was assuming that the FPGA was somehow dereferencing a hidden ROM... That is actually good news!

I am particulary interested in using all 4 inputs of the LUT as follows: Two inputs into the adder, and a third input which can be muxed (with the fourth input), so I can get either adder result or the other input. All while using the carry to increment (and preferably suppressing the carry when muxing the non-adder input). I think it's doable, although I haven't constructed the bitmap yet.

This would make a perfect Program Counter, for instance, capable of running sequentially, adding an offset, or loading an address (returning from a subroutine, for instance).

Is there no way to use the carry chain with regular LUT configurations? Is the ALU mode a simplification for 'regular people' who feel figuring out a LUT is too hard?

yrabbit commented 1 month ago

Is there no way to use the carry chain with regular LUT configurations? Is the ALU mode a simplification for 'regular people' who feel figuring out a LUT is too hard?

Well, I'm one of those people. But there are gurus who, if you wake them up in the middle of the night, will draw any LUT with their eyes closed.

Note that it was not by chance that I indicated a specific place where you can substitute your INI in gowin_pack - the thing is that fuses are installed a little higher, which switch two adjacent LUTs to ALU mode. Without this, you won't get Carry at all, but with their inclusion, you must already take into account all the logic that is connected as in the picture in the documentation (where there are many AND, XOR, etc.). So these are two different configurations: LUT vs ALU.

stacksmith commented 1 month ago

yrabbit: Thanks. I just need a couple of clarifications (forgive my ignorance):

are you suggesting that I add_alu_mode.. my own mode with the LUT the way I like?

what about the code immediately below, which seems to limit access to only 3 inputs? Can I just add the missing 'I2':f"C{alu_idx}", right in there?

        bel.portmap = {
            'COUT': f"COUT{alu_idx}",
            'CIN': f"CIN{alu_idx}",
            'SUM': f"F{alu_idx}",
            'I0': f"A{alu_idx}",
            'I1': f"B{alu_idx}",
            'I3': f"D{alu_idx}",
        }

I'm not familiar with the sequence of operation of the toolchain. After a modification here in chipdb.py, does anything need to be recompiled in the toolchain itself, or do I just run the usual makefile on my verilog?

Or should I change gowin_pack.py, replacing the lines indicated with place_lut(..)... Would that allow me to use LUT input names or the 3 ALU inputs from chipdb.py? And how would I get to this from verilog then?

Again, apologies for what may be dumb questions. I really appreciate your help.

yrabbit commented 1 month ago

The point is that it's relatively easy to experiment with the contents of the LUT and that's what the mechanism I suggested does. To add an input is a completely different song - we will have to change the routing mechanism, which is located in nextpnr, and even before that we will have to change the script, which translates ‘our’ chip base into the one understandable for nextpnr.

I don't know your level of familiarity with nextpnr sources, but let's say individual ALUs are connected in clusters, if you have no problems with this piece of code, then of course you can add an input:

https://github.com/YosysHQ/nextpnr/blob/master/himbaechel/uarch/gowin/pack.cc#L1033-L1217

stacksmith commented 1 month ago

Just to be clear, you are saying that I can change the packer to modify LUT contents, but then I still have only 3 inputs to work with?

If I start that way and replace the lines indicated with a place_lut call, do I just create an ALU instance, but add an .INIT?

yrabbit commented 1 month ago

Just to be clear, you are saying that I can change the packer to modify LUT contents, but then I still have only 3 inputs to work with?

yes

If I start that way and replace the lines indicated with a place_lut call, do I just create an ALU instance, but add an .INIT?

yes, yosys may screw up, but you can always put this parameter directly in JSON after nextpnr.

To a first approximation, if you're serious about ALU, the steps are roughly as follows:

change the file path to yosys/gowin/cells_sim.v so that the ALU has another input;
compile yosys, check for errors on a test example;
modify apycula/chipdb.py;
generate a new apicula chip database;
in nextpnr sources change file https://github.com/YosysHQ/nextpnr/blob/master/himbaechel/uarch/gowin/gowin_arch_gen.py so that your new input will be included in nextpnr chip database;
modify nextpnr's pack.cc;
... if you got here and nextpnr generates the JSON you need, then fixing gowin_pack if necessary is a small thing ;)

stacksmith commented 1 month ago

Thank you so much! I have enough to work with now.

I am kind of serious -- here is a rare opportunity of a small change which is compatible, yet makes a hidden part of the circuit available to those who are up to the challenge.

With a full LUT and carry, this FPGA is almost as good as Xilinx... Well, they have a separate carry function generator but with a bit of ingenuity, you can make very interesting counters and ALUs.

I will work on the easy way and the JSON, and get a handle on the codebase, and see how feasible this is -- and maybe bug you some more later!

stacksmith commented 1 month ago

@pepinjdevos -- do you by any chance have a trick that allows you to use the LUT in ALU mode, with all 4 inputs? How did you do what you described in the last paragraph?

pepijndevos commented 1 month ago

You can only use the full lut if the bottom 4 bits happen to align with what you need in the second lut since they share those bits. So outside of super crafty hacks and lucky chances you can really only use 3 inputs.

What I did can be achieved by simply adding a new ALU mode with the contents as described.

It could be worthwhile trying to understand the packer. It takes tho yosys alu primitive and iirc breaks it down into the flag and lut primitives that gowin_pack actually deals with. So you might bypass the alu primitive and directly create the constituent Luts with all four inputs.

Fwiw I have played around a bit with trying to coax the alu into new and useful tricks but wasn't able to come up with something substantially more useful than the Gowin modes.

stacksmith commented 1 month ago

@pepijndevos: What do you mean by 'the second lut'? Can you give me a few verilog lines that would instantiate what you are talking about? Like, what do I connect the 4th input to?

You mention above that you can program it any way, just not using the vendor's primitive. Which primitive can I use?

@yrabbit -- the simple way above did not work, as there seem to be other ALU cells being placed in my simple verilog test. I had to identify the specific ALU (by using a fake ALU_MODE "SPECIAL", which I replace with 0 and pass to place_lut). Seems to do something different from addition, now I have to figure out exactly what!

yrabbit commented 1 month ago

Well, it is somehow useless to describe all the features at once - until you soaked your legs figuratively speaking :) NEXTPNR analyzes how you use Carry and adds ALU, which either set the initial CIN value for the entire chain and/or pull out COUT into the normal space of the switching wires. Besides do not forget that ALU is always working out in pairs: one and odd switch at the same time either to LUT or to the ALU mode, since physically there is only one responsible for this fuse and NEXTPNR is forced to take this into account and supplement your ALU to one if you have managed to use their odd amount.

stacksmith commented 1 month ago

At this point I've isolated the test to the _pnr.json file containing a normal ALU test case. I manually change ALU_MODE to "SPECIAL" and add an "INIT", and let the doctored gowin_pack set ALU_MODE back to 0 and call place_lut(..). I am able to insert the init 0011000011001100 and simulate an adder. I am trying to figure out how to it works, and why those 4 zeros are there in bits 11:8.

I'm not even dealing with nextpnr for now.

The terminology is very confusing: Gowin's own ALU module in prim_syn.v shows (SUM, COUT, I0, I1, I3, and CIN). How does that map onto the LUT inputs? Where does I2 go?. And what about the A,B,C and confusingly, D inputs in the doc? If I0 is A, my counter should not work (I feed the flop output into I0 and use CI to increment -- according to the doc, A doesn't even count). It seems there is some remapping in various places which I can't find.

Also, nextpnr-gowin does not work with normal location constraints such as "INS_LOC "lu" R14C4[0][A]", so I can't fix my alu to work with it. I've been using nextpnr-himbaechel, which seems to do things differently.

yrabbit commented 1 month ago

I2 = 1 and there is some renaming for mode 0:

https://github.com/YosysHQ/nextpnr/blob/b3b239289332395d4ea0a687b14faf841a499415/himbaechel/uarch/gowin/pack.cc#L1154-L1165

LUT inputs <=> alu inputs: https://github.com/YosysHQ/apicula/blob/4f87247fca9d4e50412b6e0fa9a0ffca66046813/apycula/chipdb.py#L381-L387

and do not use nextpnr-gowin, please.

stacksmith commented 1 month ago

Thank you! I was using mode 0, and the remapping was driving me crazy!

pepijndevos commented 1 month ago

Okay before you go overboard let's work out your program counter.

alu_logic

Starting from regular add

add(0)    0011000011001100  A:-  B:I0 C:1 D:I1 CIN:0

You can see this is adding B to D with C tied high and A unused. High C selects the first and third nibble, leaving the second and fourth nibble unused, with the fourth nibble being used for the lut2 in the picture, selecting B always.

So we can tell D selects the high or low byte and B selects the first or second pair, forming B XOR D. We want to maintain this, but can look into using A and C.

We can use C to select A. Step 1 is put 0011 in the second nibble, so now C is completely ignored. So now for C=0 we still have xor with the lut2 nibble serving double duty. Then we want to change the C=1 nibbles to just select A so 1010. For a total of

1010001110101100

Your question of can you ignore the carry is simple: No, you can see in the figure it's hardwired and bypasses the Lut. Having a fast carry chain without passing through the lut is the entire point.

So that poses a problem for our sum output which will always have cin xored into it. What's more, our current mux implementation still drives cout from A+D which will then get xored into the next alu. Can we at least stop outputting cin? It's complicated...

pepijndevos commented 1 month ago

The carry section is a mindfuck because it's asymmetric but somehow works out. The way to think about it is that the lut4 output into the and gates is a mux. It either selects the carry in or the lut2, which just selects B. Since the lut4 is xor it selects the lut2 when both inputs are zero or one, in which case we want cout=A=D (the sum is 0 or 3) and in case the lut4 is 1, we select cin and xor that into the sum for a sum of 1 or 2.

So we control the lut2 but it's no good since if our lut4 outputs a 1, it is ignored completely.

You could try to get really crafty but in the end it seems to always come down to this: if you want to use a hardwired carry chain for things other than adding don't use a hardwired carry chain.

For example you could try to fish out the Lut output before the xor but at the point you'd need another mux to select the right one so you gain nothing over the straightforward solution.

Or you could break your brain over a mode that uses A to select B or C so your lut2 sees the mux select. Say for example if A=1 lut4=0 so sum=cin. But now there isn't anywhere for C to go so that's no good. We could make the lut2 A AND B so then can mux B to cout. But now we broke the full adder because we're addin C+D and now the carry is wrong. And okay we can mux B to the cout and then what, we've just performed a bit shift I guess??

So idk feel free to play around with it, it's fun to think through, but I haven't found anything more interesting than the one at the bottom of the docs.

pepijndevos commented 1 month ago

So I guess maybe some things you could do are

make an alu with a reset that outputs all zero
make like an add/shift where sum=cin, cout=A

stacksmith commented 1 month ago

@pepijndevos: appreciate your writeup, and will work through it. Immediately have a question: you say "step 1 is put 0011 into the second nybble", then show the number as 1010_0011_1010_1100... so by second you mean bits 11:8?

What is your tooling to test these things? Do you just modify the json?

I'm coming from Xilinx, where I used the carry chain for all kinds of things. For instance, I would make an SRL16 shift register from a LUT, load a single bit into it, and push it out of COUT to create a pulse generator. I would stack two of them with mutually prime loop size, and use the carry from both to detect when both hit a 1, to replace really long counters in a single slice. With 5 Spartan3 slices I could generate a VGA signal from 100MHz -- hsync, vsync, front/back porch, start-scanline pulse. It seemed like it was impossible to get carry to do the right thing, but somehow there always was a way!

pepijndevos commented 1 month ago

Yeah I filled the zeros. In this case I was counting the human way, left to right starting at 1.

I used to make ipython notebooks to try things out or yea just whatever it takes. Modify the json, modify the packer, modify nextpnr.

I think it might indeed be possible to build a shift register, which I kinda hinted at at the end of my ramblings.

stacksmith commented 1 month ago

@pepijndevos, so the SUM output is post the XOR? Is there a way to get at the LUT output?

The shift register you mention -- it would not cycle the contents of the LUT INIT, like Xilinx does, at least I don't see how. Without a register, feeding back in would be more like a ring oscillator going as fast as it can. With a register, you are back at one bit per register. Am I missing something? Oh, I think I see, you can do a multi-bit 'combinatorial' shift via carry without registers, but it doesn't really buy much...I don't think you can use the registers for anything else, can you?

I've read all I could about the so-called 'ram-based shift registers' in Gowin docs, but can't really make sense of them. They create encrypted IP, and seem to imply that you can use LUTs (as in memory?), registers, or blockram, but I don't quite see how it's anything other than what you can do with a few lines of obvious verilog.

Maybe there is some way to shift an SSRAM cell and I would love to see how. Xilinx was very generous exposing the shift machinery (which is probably used to configure the LUTs).

YosysHQ / apicula

Please clarify ALU doc #281