enjoy-digital / litex

Build your hardware, easily!
Other
2.9k stars 556 forks source link

Custom Function Unit (CFU) integration #883

Closed tcal-x closed 3 years ago

tcal-x commented 3 years ago

I have been experimenting with custom function units (CFUs) added to VexRiscv (using VexRiscv since it provides a CfuPlugin option, and it's also awesome, although eventually we hope more CPUs support the interface). The CFU connects ONLY to the CPU, not to the system bus, and it has no CSRs of its own. I've been using CPU+CFU in LiteX systems for quite a while now, but I've been hiding the CFU from LiteX --- I make a wrapper containing the CPU+CFU, and the wrapper exports just the normal CPU signals, so LiteX doesn't even know the CFU is there.

But now I think it makes sense to hook up the CFU in LiteX, so I'd like to ask your advice. I've mocked up a couple of ways of doing it. I have the code for connecting the CFU to the VexRiscv in cores/cpu/vexriscv/core.py --- it creates and hooks the signals, creates the CFU instance, and optionally adds the CFU source. But it might make more sense to have the CPU just create its interface when needed, and move the CFU instantiation and CFU<->CPU hookup out of the CPU core.py.

I've had a suggestion to create a Record for the CFU<->CPU connections.

The interface will be fairly stable. It is basically one cmd stream CPU-->CFU with a payload of an opcode and two data arguments, and a rsp stream CFU-->CPU with a payload of one data result and a ok status bit. There will likely also be a context_id field in the cmd payload, although that is still being finalized.

enjoy-digital commented 3 years ago

Thanks @tcal-x, sorry I've not yet been able to look at it, but just saw https://github.com/google/CFU-Playground and now have a better understanding of the need.

tcal-x commented 3 years ago

Thank you @enjoy-digital -- I can see how many issues you're handling. In fact, the liteSPI work is also something I'm happy to see getting attention.

rdolbeau commented 3 years ago

@tcal-x Out of curiosity, as you seem interested in performance, have you evaluated upcoming RISC-V extensions as a preliminary step to the full-custom functional unit ?

Your example includes byte-reverse & bit-reverse, both of which are part of bitmanip 'B' as specific sub-cases of grevi (rev8 and rev), which can be added to VexRiscv. But the choice of those might be for the sake of simplicity.

Another candidate might be packed-simd 'P', although that's a lot more complex to implement and I don't think VexRiscv has full support (I only have a small set of P instructions as proof-of-concept, missing the saturation csr and the 4i2o support but with 2/3i2o support for e.g. UMUL8).

tcal-x commented 3 years ago

@rdolbeau great question (and one I'd been anticipating) --

Designing an ISA extension is very difficult -- that's one reason the extensions you mention are still "upcoming" -- there are many stakeholders and many constraints. Can the compiler use this instruction? Is it difficult to implement? Do too many instructions overlap with functionality in another extension? Is the naming consistent?

The approach here is at the other extreme -- you could consider these 'disposable' extensions that will be used one time only. You are the only stakeholder, and you're designing these custom instructions just for this application, right now. You have no obligation of future support. You can go a completely different direction for the next update (if you envision an embedded application) -- you'd never ship a new firmware update to run on the old gateware -- you'd always ship new a CPU/CFU gateware with the new firmware.

I expect users will definitely cherry-pick (i.e. blatantly steal) from one or multiple official extensions to curate their custom ISA. And likely over time they will develop a library of building blocks that they can adapt for a new CFU. It might even be that down the road, experience with CFUs in this context can inform future (official) ISA extensions.

rdolbeau commented 3 years ago

@tcal-x Thanks, seems we had similar goals :-)

My plugin generator for VexRiscv (https://github.com/rdolbeau/VexRiscvBPluginGenerator/) was made to be able to add an arbitrary subset of instructions from a small database of opcode patterns and easily-written semantics (in SpinalHDL) into the pipeline of VexRiscv. Although most of the instructions are from upcoming extensions (whole of B and K, bits of P), I've also tried Chacha20-specific instructions - and you can get a nice speed-up from a set of 3i2o operations, reusing P's 64-bits operation type (and opcodes at the moment; see https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt ).

I have a background in compiler, I'm a bit more fond of standard instructions. And the prototype GCC does make good use of them - I've recompiled quite a few of the basic Linux packages for rv32imafdcbk_zbr_zbt in LiteX, and there's plenty of sh[123]add for address computations, more rotations and grevi (bit-reverse, byte-reverse, ...) than I expected, some logicals (mostly andn with some orn and next-to-none xnor), even some cmix(ternary bit-selector from Zbt) in crypto libraries, etc. Some instructions are not yet generated so don't appear of course (e.g. pack is plentiful but only used for zero-extension).

Depending on your algorithms, it might be worth trying using the B-enabled GCC to see if the upcoming extensions' instructions are used, and if so - maybe even try for performance in LiteX with my plugins :-)

enjoy-digital commented 3 years ago

@tcal-x; @rdolbeau : Interesting, your use cases seems a very good candidate for partial reconfiguration :):

Even without partial reconfiguration support in Symbiflow, I'm sure interesting things could be done with Symbiflow where the CFUs would just be re-integrated in the SoC bitstream (with empty CFU zones) and would then provide faster compile time.

But I'm just sharing some un-realistic ideas/goals :)

tcal-x commented 3 years ago

@enjoy-digital , thank you for this interesting line of thinking!

In fact there has been some work in partial reconfiguration with SymbiFlow, by @andrewb1999 and his advisor @dehon -- here's an abstract from the GSoC project: https://summerofcode.withgoogle.com/archive/2020/projects/5766524633088000/

Separately, the RISCV Soft CPU Task Group is looking at how to make CFUs exchangeable and composable, so that you could put together a system of CFUs from different sources, using metadata packaged with each, and get a composed system that just works. This link may be out of date but it gives a reasonable overview: https://cfu.readthedocs.io/en/latest/ . @Dolu1990 is involved in this effort as well.

But I hadn't actively considered using partial reconfiguration to speed up iteration in the CFU-Playground...that sounds like an interesting project for someone to take on! (any takers?! contact me!)

Currently our CFUs rely on being flattened and intermingled with the CPU for synthesis and placement to meet timing. Enforcing the CPU:CFU boundary in physical implementation would probably force us to do some minor rewriting such as inserting registers. But it's a trade-off worth considering.

mithro commented 3 years ago

While these are all exciting ideas, I don't think we should block landing some initial CFU support.

tcal-x commented 3 years ago

Thanks @mithro , agree 100%.

I should mention that for the initial landing, we want exactly one CFU module. It is given 10 bits of function_id (i.e. opcode) so it can implement many different custom instructions. But it is just one module.

In the future, we could consider composing multiple CFUs in LiteX, essentially routing commands and responses via generated interconnect.

Also, finally, up until now I have only considered the CFU being provided as a Verilog module and instantiated as a special, but perhaps it is easy to alternately allow the user to specify the CFU in Migen in their Python script. If it's not easy, then we can consider it later.

enjoy-digital commented 3 years ago

@mithro, @tcal-x: Sure, I was digressing a bit :)

For the CFU integration in VexRiscv, I would just add a method similar to this to VexRiscv:

    def add_cfu(self, cfu_filename):
        # CFU Bus.
        cfu_bus = Record(...)

        # Add CFU.
        self.specials += Instance("cfu",
            i_X = cfu_bus.x,
            o_Y = cfu_bus.y,
        )
        self.platform.add_source(cfu_filename)

        # Connect CFU to CPU.
        self.cpu_params.update(
            o_X = cfu_bus.x,
            o_Y = cfu_bus.y,
        )

We'll improve it as CFU matures and could also eventually extend it to other CPUs.

tcal-x commented 3 years ago

Thanks @enjoy-digital ! I'm sorry I've been very busy today and haven't had a chance to follow up on this. Soon!

enjoy-digital commented 3 years ago

Closing since CFU has been merged with https://github.com/enjoy-digital/litex/pull/908.

tcal-x commented 3 years ago

Thanks @enjoy-digital , apologies for forgetting to close this.