Closed mithro closed 4 years ago
@shenki
I now have MW working locally with LiteDRAM using a custom wishbone => LiteDRAM native bus adapter which also does the bus upsizing.
At the moment, LiteDRAM is built with a built-in riscv for the memory inits. I'll be looking at hooking up the CSRs to the wishbone and porting the LiteX BIOS code in the next few days as travel & time permits.
That way I can take out the riscv, uart and other gunk in the LiteDRAM core and save space on the Arty.
I was going to look at proper LiteX integration next (I learning LiteX as I go). I've done my own wishbone adapter in vhdl mostly because the current LiteX ones have broken bus UpConverters, but that should be fixed eventually.
Great thanks, your approach is probably the best to get into things progressively. Happy to help for creating the CPU wrapper.
Can you give more informations about the broken bus UpConverters, i'd like to have a look. (or you can create another issue for this is you want).
Mostly one of them won't do "both" (it's in the code), only one direction, and the other one uses FlipFlop() which seems to be deprecated..
(sorry in an airline lounge about to board my flight to europe so a bit terse :-)
Microwatt has been integrated as a submodule, wrapped with a vhdl/migen wrapper and gateware has been integrated in LiteX. Minimal software support has also been added. The software and gateware compiles fine. We now need to simulate a SoC with Microwatt CPU (we can't use LiteX simulator since Microwatt is in VHDL, unless we have a verilog model of it it) and finish the software support. Any help is welcome :)
GHDL now seems to be able synthetize Microwatt: https://twitter.com/antonblanchard/status/1219448773333487616
This indeed seems to be working with the attached script/procedure:
Install GHDL
$ git clone https://github.com/ghdl/ghdl
$ cd ghdl
$ ./configure --enable-libghdl --enable-synth
$ make
$ make install
Get Microwatt:
git clone https://github.com/antonblanchard/microwatt
cd microwatt
git checkout ghdl-synthesis
Synthetize Microwatt sources:
./microwatt_ghdl_synth.py > microwatt.vhd
microwatt_ghdl_synth.py:
#!/usr/bin/env python3
import os
files = [
# Common / Types / Helpers
"decode_types.vhdl",
"wishbone_types.vhdl",
"utils.vhdl",
"common.vhdl",
"helpers.vhdl",
# Fetch
"fetch1.vhdl",
"fetch2.vhdl",
# Instruction/Data Cache
"cache_ram.vhdl",
"plru.vhdl",
"dcache.vhdl",
"icache.vhdl",
# Decode
"insn_helpers.vhdl",
"decode1.vhdl",
"gpr_hazard.vhdl",
"cr_hazard.vhdl",
"control.vhdl",
"decode2.vhdl",
# Register/CR File
"register_file.vhdl",
"crhelpers.vhdl",
"cr_file.vhdl",
# Execute
"ppc_fx_insns.vhdl",
"logical.vhdl",
"rotator.vhdl",
"countzero.vhdl",
"execute1.vhdl",
# Load/Store
"loadstore1.vhdl",
# Multiply/Divide
"multiply.vhdl",
"divider.vhdl",
# Writeback
"writeback.vhdl",
# Core
"core_debug.vhdl",
"core.vhdl",
]
for f in files:
os.system("ghdl -a --std=08 ../{}".format(f))
os.system("ghdl --synth --std=08 core")
With https://github.com/enjoy-digital/litex/commit/9bef218ad6616d4d8b958e34de1f6e87b7cbdd99, Microwatt is now running on hardware. It will still be useful to support the GHDL-synth flow to ease simulations and use the FOSS toolchains.
Install ghdl-yosys-plugin:
git clone https://github.com/ghdl/ghdl-yosys-plugin
make
sudo cp ghdl.so /usr/local/share/yosys/plugins/ghdl.so
Generate the verilog (from ghdl-synthesis-test branch):
microwatt.ys
:
ghdl --ieee=synopsys -fexplicit -frelaxed-rules --std=08 \
decode_types.vhdl \
wishbone_types.vhdl \
utils.vhdl \
common.vhdl \
helpers.vhdl \
fetch1.vhdl \
fetch2.vhdl \
cache_ram.vhdl \
plru.vhdl \
dcache.vhdl \
icache.vhdl \
insn_helpers.vhdl \
decode1.vhdl \
gpr_hazard.vhdl \
cr_hazard.vhdl \
control.vhdl \
decode2.vhdl \
register_file.vhdl \
crhelpers.vhdl \
cr_file.vhdl \
ppc_fx_insns.vhdl \
logical.vhdl \
rotator.vhdl \
countzero.vhdl \
execute1.vhdl \
loadstore1.vhdl \
multiply.vhdl \
divider.vhdl \
writeback.vhdl \
core_debug.vhdl \
core.vhdl \
microwatt_wrapper.vhdl \
-e microwatt_wrapper
write_verilog microwatt.v
yosys -q -m ghdl microwatt.ys
Looks great, I'll play with this and maybe integrate some of that into Microwatt own makefiles, it will definitely be useful for simulating with litedram.
BTW. What do you use on the DDR side for simulating litedram ? A micron model ? Or do you have your own ?
@ozbenh: just for info, with this, GHDL/Yosys were able to convert Microwatt to verilog using the ghdl-synthesis-test
branch or Microwatt. I tried litex_sim and Verilator was able to compile it and run it but the BIOS was not showing up and i haven't investigated. If you want to run the simulation, you can follow the previous steps to generate microwatt.v
then replace this: https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/cpu/microwatt/core.py#L105-L158 with platform.add_source("microwatt.v")
and do: litex_sim --cpu-type=microwatt
(you can add--trace
to generate the simulation waveform and see what is going on).
For the simulation, we have a DRAM model that we use with litex_sim: https://github.com/enjoy-digital/litedram/blob/master/litedram/phy/model.py.
Thanks. Is there a way for LiteX to generate a verilog version of the DRAM model ? For the "standalone microwatt" case, I want to toy around with the user port interface to wishbone to do things like pipelining etc... and the easiest seems to be to do it in verilog using a little test bench, and throw the whole lot at verilator. I can then use that verilog in microwatt directly or convert it back to vhdl.
As for running the converted microwatt, I'll give that a try asap.
Hrn... thinking twice, that means I probably also need sim models of all the xilinx PLL etc... that won't be as easy as I initially thought...
Allright, had to hack/tweak a few things, I'll get back to you, I now got the sim running. I'll try to get to the bottom of it but it might take a while. I assume there's no way to get the report() statements out of ghdl.... Also note that --trace-fst and --trace-end xxx both generate errors when building the sim.
@ozbenh: good, have you also been able to get the CPU/BIOS working in simulation? I could work on finishing the integration with litex_sim next week. I'll also look at --trace-fst/--trace-end.
No I haven't yet. I can see the CPU fetching some instructions and I see them out of the icache but it stops doing that sanely pretty quickly. I haven't figured out why yet. Note: It's a very painful process, because microwatt stores everything in records and the ghdl-synth+yosys process turns all these into giant vectors :-( Also the vcd files coming out of litex are humongous :-)
I wish instead the records would be broken in separate wire/vectors with something like recordname_wirename instead...
Anyway, I'll continue digging as time permits.
I also noticed a while pile of warnings out of yosys (or maybe verilator?) about Case values overlap (example pattern 0x3). These seem to come from a whole bunch of those constructs in the generated verilog that do look bogus:
input [2:0] a;
input [23:0] b;
input [7:0] s;
(* parallel_case *)
casez (s)
8'b???????1:
\8878 = b[2:0];
8'b??????1?:
\8878 = b[5:3];
8'b?????1??:
\8878 = b[8:6];
8'b????1???:
\8878 = b[11:9];
8'b???1????:
\8878 = b[14:12];
8'b??1?????:
\8878 = b[17:15];
8'b?1??????:
\8878 = b[20:18];
8'b1???????:
\8878 = b[23:21];
default:
\8878 = a;
endcase
endfunction
I'm pretty sure the "simplified vhdl" that ghdl spits out has all those "?" as "0"
So there was a ghdl synth bug. I made a test case and Tristan fixed it (https://github.com/ghdl/ghdl/issues/1319). It works in sim with the latest microwatt, though you probably want the patch below applied to microwatt (at least until Anton merges it ) and then wire the interrupt to the core to '0'.
Note about interrupts: If we're ever going to run Linux on microwatt with LiteX we'll want the xics interrupt controller model, not the traditional LiteX one. Which probably means adding SW support for it as well to the LiteX BIOS.
[PATCH] irq: Simplify xics->core irq input
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
common.vhdl | 5 -----
core.vhdl | 4 ++--
execute1.vhdl | 4 ++--
soc.vhdl | 6 +++---
xics.vhdl | 4 ++--
5 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/common.vhdl b/common.vhdl
index ed97e0c..61252bd 100644
--- a/common.vhdl
+++ b/common.vhdl
@@ -316,11 +316,6 @@ package common is
constant WritebackToCrFileInit : WritebackToCrFileType := (write_cr_enable => '0', write_xerc_enable => '0',
write_xerc_data => xerc_init,
others => (others => '0'));
-
- type XicsToExecute1Type is record
- irq : std_ulogic;
- end record;
-
end common;
package body common is
diff --git a/core.vhdl b/core.vhdl
index 0664c73..f3806a3 100644
--- a/core.vhdl
+++ b/core.vhdl
@@ -34,7 +34,7 @@ entity core is
dmi_wr : in std_ulogic;
dmi_ack : out std_ulogic;
- xics_in : in XicsToExecute1Type;
+ ext_irq : in std_ulogic;
terminated_out : out std_logic
);
@@ -272,7 +272,7 @@ begin
flush_out => flush,
stall_out => ex1_stall_out,
e_in => decode2_to_execute1,
- i_in => xics_in,
+ ext_irq_in => ext_irq,
l_out => execute1_to_loadstore1,
f_out => execute1_to_fetch1,
e_out => execute1_to_writeback,
diff --git a/execute1.vhdl b/execute1.vhdl
index 8286d30..fccba5e 100644
--- a/execute1.vhdl
+++ b/execute1.vhdl
@@ -24,7 +24,7 @@ entity execute1 is
e_in : in Decode2ToExecute1Type;
- i_in : in XicsToExecute1Type;
+ ext_irq_in : std_ulogic;
-- asynchronous
l_out : out Execute1ToLoadstore1Type;
@@ -410,7 +410,7 @@ begin
ctrl_tmp.irq_nia <= std_logic_vector(to_unsigned(16#900#, 64));
report "IRQ valid: DEC";
irq_valid := '1';
- elsif i_in.irq = '1' then
+ elsif ext_irq_in = '1' then
ctrl_tmp.irq_nia <= std_logic_vector(to_unsigned(16#500#, 64));
report "IRQ valid: External";
irq_valid := '1';
diff --git a/soc.vhdl b/soc.vhdl
index 841d72f..400b230 100644
--- a/soc.vhdl
+++ b/soc.vhdl
@@ -100,7 +100,7 @@ architecture behaviour of soc is
signal wb_xics0_out : wb_io_slave_out;
signal int_level_in : std_ulogic_vector(15 downto 0);
- signal xics_to_execute1 : XicsToExecute1Type;
+ signal core_ext_irq : std_ulogic;
-- Main memory signals:
signal wb_bram_in : wishbone_master_out;
@@ -170,7 +170,7 @@ begin
dmi_wr => dmi_wr,
dmi_ack => dmi_core_ack,
dmi_req => dmi_core_req,
- xics_in => xics_to_execute1
+ ext_irq => core_ext_irq
);
-- Wishbone bus master arbiter & mux
@@ -512,7 +512,7 @@ begin
wb_in => wb_xics0_in,
wb_out => wb_xics0_out,
int_level_in => int_level_in,
- e_out => xics_to_execute1
+ core_irq_out => core_ext_irq
);
-- BRAM Memory slave
diff --git a/xics.vhdl b/xics.vhdl
index 421513a..4d3e9e5 100644
--- a/xics.vhdl
+++ b/xics.vhdl
@@ -35,7 +35,7 @@ entity xics is
int_level_in : in std_ulogic_vector(LEVEL_NUM - 1 downto 0);
- e_out : out XicsToExecute1Type
+ core_irq_out : out std_ulogic
);
end xics;
@@ -80,7 +80,7 @@ begin
wb_out.dat <= r.wb_rd_data;
wb_out.ack <= r.wb_ack;
wb_out.stall <= '0'; -- never stall wishbone
- e_out.irq <= r.irq;
+ core_irq_out <= r.irq;
comb : process(all)
variable v : reg_internal_t;
Great! Thanks for looking at this, i'll reproduce your results and will do the LiteX integration to automate this when runnning litex_sim --cpu=microwatt
.
@ozbenh: with https://github.com/enjoy-digital/litex/commit/a02077d547d603d3cbf9bcbcd365efcf084969e3, you now just have to set use_ghdl_yosys_synth
to True to convert the Microwatt sources from VHDL to verilog automatically during the build. So if you want to use it in simulation, just do litex_sim --cpu-type=microwatt
or with a target: target.py --cpu-type=microwatt --build
(i haven't tested on hardware yet since it seems the caches are not inferred correctly and the resource usage explodes).
Great, thanks. Yes there are problems with how memories are inferred with Yosys still.
By reducing the number of ICache/DCache lines to 2
to avoid the resource usage explosion, the generated verilog is working fine on hardware and built with FOSS tools :) : https://twitter.com/enjoy_digital/status/1262701132012490754
The GHDL-Yosys-plugin path can now be selected with --cpu-variant=standard+ghdl
. We can now simulate and build Microwatt with vendors' or FOSS toolchains:
Simulation with the verilog generated from GHDL-Yosys-plugin and Verilator:
lxsim --cpu-type=microwatt --cpu-variant=standard+ghdl
Build on Arty with the VHDL files:
./arty.py --cpu-type=microwatt
Build on Arty with the verilog generated from GHDL-Yosys-plugin:
./arty.py --cpu-type=microwatt --cpu-variant=standard+ghdl
Some improvements can still be done on the integration (add burst/irq support) but this could be discussed in more specific issues/PRs.
@ozbenh Was this fixed already? I've been digging around the LiteX source trying to find out, but am not sure.
At the moment, LiteDRAM is built with a built-in riscv for the memory inits. I'll be looking at hooking up the CSRs to the wishbone and porting the LiteX BIOS code in the next few days as travel & time permits.
What specifically ? Microwatt "standalone" works with LiteDRAM and LiteX can use Microwatt as core both :-) There's still work to do and Linux doesn't boot yet in the LiteX version (it does with hacks in standalone Microwatt) but yes, whatever you're talking about is probably "fixed" :-)
@ozbenh Yeah, was referring specifically to LiteDRAM and whether we needed the little RISC-V core or whether Microwatt can now handle everything. Sounds like we can do a pure PPC design at this point without RISC-V embedded somewhere in it?
Yes, I've even removed the remaining riscv bits from the generator script
@ozbenh Tried to do a quick build for the Versa board, but it's running out of resources. Is the cache line hack still needed to get it to fit?
Not sure, maybe. It might also depend on the version of ghdl and yosys no ? There's a bug somewhere in how they interpret RAMs, that said recent ghdl (from git) will properly pass the attributes we set down to yosys. Not sure it uses them properly.
You'll need to dig in and look and maybe tune memory/cache sizes.
Note: For Linux you probably need DRAM, for which you'll need a memory controller. I currently generate litedram for Arty and Nexys_video, I could generate it for other boards if you give me details about them. That sais, the DRAM wrapper comes with an L2 cache that's also a heavy user of block RAMs and could be an issue as well.
Finally, after the huge shrinking phase a few months ago, the core has gotten bigger again lately, especially with the addition of the MMU.
@ozbenh Yeah, running GIT master of each due to the known issues. Been using Yosys for a while now (Verilog) and never could get its inference to work for anything more complex than a single port RAM, even then it seemed twitchy. Our policy for a while now has been to manually instantiate RAM and IO primitives exactly because of those issues; I may need to see if I can hack up the Microwatt sources enough to pull in a straight *16K RAM (I'm using the ECP5 Versa at the moment, final target will be a custom device with a larger ECP5 and DDR3 RAM).
Even reducing the cache line size to 2 ICACHE / 2 DCACHE and removing the DDR controller (65k internal RAM allocated for testing) the resource use is over 75%. Does that mach what you are seeing with the new "bloated" Microwatt, or is this so excessive that I should really focus on fixing that RAM issue first?
I don't know if 75% is big or not for that FPGA, sadly I don't have any ECP5 hardware to play with, but you might be bitten by the size of the TLBs as well, we could look into reducing them or at least bringing the parameters up.
As for the cache RAMs, I purposefully made them a separate module cache_ram.vhdl so it can be easily replaced with some kind of manual instanciation. The L1 caches are always 64-bit wide and as tall as needed calculated on the cache sizes.
Note about memories: There are actually 3 kinds and I don't know how well/badly Yosys handles them even with the above issue solved.
Block RAMs (big and slow) which we use for the cache data (L1 and L2)
Distributed "LUT" RAM which is better than registers which we use for most other things (register file, cache tags, TLBs, ...),.
Registers (flops). The most wasteful
From reading the issue above, it's hard to tell whether Yosys plans to handle all 3, but if it does turn everything into flops then yes, it will be horrible.
Note: We should look into making the TLBs block RAMs or at least make the ones inside the I and D cache small and feed off a larger one in block RAM.
@ozbenh "LUT RAM" is not well handled by Yosys at the moment.
LUT RAM can do many port configurations which can't be represented in Yosys at the moment.
Ok, that's going to be a problem for Microwatt. We rely heavily on it. Without LUT RAM things like cache tags, TLBs and register file will be orders of magnitude larger in the generated FPGA (and timing will go down the sink). Basically anything large that needs async read is a LUT RAM for us
and doing sync reads would require adding even more pipeline stages/latency
LUT RAM for ECP5 is fully supported in Yosys, there is only one configuration.
Thanks Dave. As long as it infers a 2D array with synchronous writes and async reads a a LUT RAM we should be ok with microwatt. If I had an ECP5 board at hand I could give it more love (and generate litedram for it) but I don't at the moment and can't quite spare the funds right now.
As for block RAMs, we have wrappers for all of our use of it which could easily be replaced by explicit primitives if necessary. I did that to make it easier to either replace them or tweak them to match tool inferrence limitations. Note that our dcache does use the "output register" option of Xilinx block RAMs to help timing.
@ozbenh Still having a real hard time getting Microwatt to actually fit on an ECP5-45 with any room left over for significant (>5% die area) peripherals, even with caches cut back to one line each. Any other thoughts on trimming Microwatt back somewhat and making it fit better?
Not really. Someone who understands the toolchain should look into where is most of the area going, it's strange that it seems to be using 2 to 3 times more LUTs than on the Artix...
@madscientist159: last time i tested on ECP5, Yosys still had issues with the caches and i had to reduce NUM_LINES
of the icache
and dcache
(tested with 2 instead of 64 as it was done in the initial GHDL Synth tests).
@enjoy-digital I'm already doing that, it's still sitting at a ridiculously high resource usage. Switching to the default RISC-V CPU (which we can't really use for other reasons -- not even set up to test / debug with it beyond synthesis) yields a drop from 95% usage to 50% usage on the ECP5, with all other peripherals etc. unchanged.
1959
That's it exactly, I did some digging and apparently the required TDP RAMs are not supported by Yosys, causing a ridiculous explosion in resources (something over 11k cells just for a two-line I cache and D cache). Relevant failure:
Checking rule #4 for bram type $__ECP5_DP16KD (variant 1):
Bram geometry: abits=10 dbits=18 wports=0 rports=0
Estimated number of duplicates for more read ports: dups=1
Metrics for $__ECP5_DP16KD: awaste=960 dwaste=8 bwaste=17792 waste=17792 efficiency=5
Rule #4 for bram type $__ECP5_DP16KD (variant 1) accepted.
Mapping to bram type $__ECP5_DP16KD (variant 1):
Shuffle bit order to accommodate enable buckets of size 9..
Results of bit order shuffling: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 4$
Write port #0 is in clock domain \clk.
Mapped to bram port A1.
Read port #0 is in clock domain !~async~.
Bram port B1.1 has incompatible clock type.
Failed to map read port #0.
Mapping to bram type $__ECP5_DP16KD failed.
@ozbenh I wonder if we could add a mode to Microwatt that just disables the caches for now since a two-line cache isn't going to be useful in the first place, and the amount of resources sucked down are causing the design to be near worthless on real world FPGAs?
What RAM object is this ... the main cache rams don't have async reads, they are SDP with one sync read and one sync write port... the tags however have async read, are they the problem here ? For 2 lines there should be only 2 tags and they should fit in LUT RAMs. The above log doesn't say which array/entity it is.
We don't really have a design that runs with caches off at this point, we would have to change thing potentially quite a bit, but I'd rather we fixed the above.
BTW. How did you change the cache sizes ? Where did you edit the generic ? You need to change the values in core.vhdl not the defaults in icache.vhdl or dcache.vhdl
Also.. the TLBs have async reads, so they would fit in LUT RAM. We might be able to make things smaller by having both tags and TLBs in block RAM but at the cost of some extra latency
@ozbenh RAM object is "icache_32_2_2_64_12_56_5ba93c9db0cff93f52b521d7420e43f6eda2784f.\897:", there are a bunch of them that are similar. The entire I cache reports no BRAM usage and a ton of cells used:
=== icache_32_2_2_64_12_56_5ba93c9db0cff93f52b521d7420e43f6eda2784f ===
Number of wires: 5777
Number of wire bits: 9501
Number of public wires: 5777
Number of public wire bits: 9501
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 6802
L6MUX21 1018
LUT4 3750
PFUMX 1468
TRELLIS_DPR16X4 112
TRELLIS_FF 450
cache_ram_3_64_1489f923c4dca729178b3e3233458550d8dddf29 2
plru_1 2
Also, curiously, the rotator is using a ridiculous amount of resources:
=== rotator ===
Number of wires: 6345
Number of wire bits: 9665
Number of public wires: 6345
Number of public wire bits: 9665
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 7242
CCU2C 326
L6MUX21 1011
LUT4 4050
PFUMX 1855
Those two blocks alone account for 20% of the entire resource usage of the LiteX/Microwatt design, so something seems off. :wink:
EDIT: Also, yes, defaults changed in core.vhdl. It literally won't fit at all even with a bare bones design if the caches aren't reduced significantly (I reduced them to two lines each as that seems to be as small as they will go).
@ozbenh I suppose one approach could be to move the inferred block RAMs into their own module, so that those of us with toolchains that don't actually infer BRAMs (like the Yosys one) could manually insert a device-specific instantiation...
https://github.com/antonblanchard/microwatt