SpinalHDL / NaxRiscv

MIT License
267 stars 40 forks source link

Mark framebuffer as non-cacheable #74

Open egorxe opened 10 months ago

egorxe commented 10 months ago

First of all, thanks for the great work with NaxRiscv! With your howto I was able to launch Debian on latest LiteX with 175 MHz dual core NaxRiscv and 1MB L2 cache on Alinx AXKU040 board (Kintex Ultrascale). I could confirm that everything including X is stable and works just fine: Xfce with mouse & keyboard over usbip (my board has no USB) is entirely usable, OpenTTD is playable at 800x600 and mplayer can play 240p H264 videos without slowdown. This is quite an achievement for soft CPU!

I observe only one small problem. On small framebuffer updates glitches appear around updated shapes for several seconds as shapes are not fully redrawn. This is most noticeable with single character output in console framebuffer or mouse pointer moves in X. I think that this problem comes from framebuffer updates staying in NaxRiscv L2 cache as glitches become much more prominent with larger caches (I've increase L2 to 1MB) and during system idle. If CPU is working hard, mouse pointer moves without glitches, probably because L2 is getting rewritten fast, but when CPU is idle, glitches can stay up to ~10 seconds with larger caches.

So my question is: is there a way to mark framebuffer memory as non cacheable? DMA buffers, including framebuffer, should either bypass L2 or flush it after each write. I've tried to add "no-map" attribute to framebuffer@40c00000 in dts (found it in some Xilinx doc), but it seems to change nothing. May be there is some dts or driver patch that could fix this issue?

This is the command I've used to build bitstream: python3 -m litex_boards.targets.alinx_axku040.py --cpu-type=naxriscv --bus-standard axi-lite --with-video-framebuffer --with-coherent-dma --with-sdcard --with-ethernet --l2-bytes 1048576 --xlen=64 --scala-args='rvc=true,rvf=true,rvd=true,alu-count=2,decode-count=2' --with-jtag-tap --sys-clk-freq 175e6 --cpu-count 2 --l2-size 0 --build

alinx_axku040.py is written by me, but it's fully based on xilinx_kcu105.py with addition of framebuffer like in digilent_nexys4.py for example.

Dolu1990 commented 10 months ago

usbip

Ahhh nice i didn't knew it

stable and works just fine

:D

mplayer can play 240p H264 videos without slowdown

O.o Can you share the video file ?

--l2-bytes 1048576

XD that's a lot XD Nice ^^

I've tried to add "no-map" attribute to framebuffer@40c00000

Won't help. That's only in hardware

So my question is: is there a way to mark framebuffer memory as non cacheable?

For the L1, we realy need it to be cachable, else it would be very very slow. For the L2, unfortunatly, the cache is inclusive, so anything the CPU has in L1 also need to be in L2.

The main issue is that the litex video dma is directly connected to the litedram controller, instead of passing by the L2 cache for snoops.

So 2 solutions (which would need dev) :

I would say the second option is one which would work for sure, and would probably not be toooo hard to implement. I will give a try.

egorxe commented 10 months ago

Can you share the video file ?

I've actually tried several and all of them played relatively fine. But I have no sound device, so not sure if it will be fast enough to play videos with sound. Also it plays on full speed only in native video size of 320x240, upscaling it to screen size kills the performance. I'm attaching a small cut from Blender open movie Sintel as an example.

https://github.com/SpinalHDL/NaxSoftware/assets/13577050/243a7338-c803-4f85-ae1b-2b3bfc316553

Either adding flush capabilities to the L2 cache and periodicaly flushing it, so having it periodicaly cleaning its dirty cache lines.

So there is no capability to flush L2 now? If such capability is introduced, it will be quite easy to patch fb driver to flush changed cache lines periodically or on each fb write.

Moving fb DMA from separate LiteDRAM port to system bus is not that great I think. Apart from need to ignore caching for DMA, it will also put quite large load (115MB/s for 800x600) on system bus and L2 without real benefit.

Also I would like to note a couple of things more:

  1. 2GByte of DRAM works fine, the only required change was changing io_regions in core.py to {0xC000_0000: 0x4000_0000}. Is NaxRiscv expected to work if I extend address space to 64 bit in core.py and move io and csr region to try using full 4GB available on my board? It seems LiteX introduced 64-bit buses recently https://github.com/enjoy-digital/litex/issues/1844 .
  2. It looks like OpenSBI reserves memory in device tree for itself, so no need to have it reserved in dts. I even had to remove it cause it conflicted with OpenSBI reserved space after extension of main memory to 2GB,
egorxe commented 10 months ago

BTW 4 cores also work without issues, but such config rarely achieves decent timings for my fpga with 175MHz system clock. With 4 cores@175MHz and 2G RAM even firefox becomes usable enough for me to write this comment directly from it, although it's definitely not very pleasant user experience :).

I've used 7-zip and Linpack (from hpcc Debian package) to benchmark multicore integer & float performance. With 620 7z decompression MIPS and 130 Linpack Mflops this system (LiteX with 4 NaxRiscv cores@175MHz) is roughly on par with single core A9@800MHz used in Zynq 7000s chips. Great result!

Dolu1990 commented 10 months ago

So there is no capability to flush L2 now?

Right.

If such capability is introduced, it will be quite easy to patch fb driver to flush changed cache lines periodically or on each fb write.

I was thinking more about a fully hardware solution, where the cache periodicaly scrub itself from dirty cache lines.

Is NaxRiscv expected to work if I extend address space to 64 bit in core.py and move io and csr region to try using full 4GB available on my board?

I realy have no idea, i never tested more than 32 physical address space. Likely something will break somewere XD

It looks like OpenSBI reserves memory in device tree for itself, so no need to have it reserved in dts. I even had to remove it cause it conflicted with OpenSBI reserved space after extension of main memory to 2GB,

Ahh i didn't knew that. But how opensbi knows where is the device tree ? Is that something happening while linux boot via the software SBI interface ? or durring opensbi boot ?

BTW 4 cores also work without issues,

Nice :D I never tested it, i always was with 2 cores. So i noticed that Vivado timings seems to break down with larger design, even when there is no timing coupling in the hardware :(

even firefox becomes usable enough for me to write this comment directly from it, although it's definitely not very pleasant user experience :).

It seems that the 1 MB of l2 cache realy help, when i tested it was painfull (128 KB L2, dual core, 100 Mhz)

egorxe commented 10 months ago

I was thinking more about a fully hardware solution, where the cache periodicaly scrub itself from dirty cache lines.

Is there a reason to do it besides framebuffer? For fb to look nice you'll need to writeback cache every 20 ms or so, it will add some memory load, especially in case of larger L2. On the other hand if you'll implement some solution to trigger forced L2 cache line writeback from software (cbo.clean instruction from Zicbom for example) it wont hurt anything and will be generally more useful I think. Is it hard to do?

I realy have no idea, i never tested more than 32 physical address space. Likely something will break somewere XD

I'll try it and report if I'll have any success then :).

Ahh i didn't knew that. But how opensbi knows where is the device tree ?

Probably from here? https://github.com/Dolu1990/opensbi/blob/034d47a8299316481d007c502fdff54ba8eb3226/platform/litex/naxriscv/objects.mk#L30C1-L30C1

It seems that the 1 MB of l2 cache realy help, when i tested it was painfull (128 KB L2, dual core, 100 Mhz)

Not only 1MB L2, but also higher frequency and 2 more cores as Firefox seems to be parallel enough to fully utilize all of them. And don't get me wrong, it's still very slow :). It takes a couple of minutes to start, ~15-20 seconds to render typical static Wikipedia page and ~40 seconds to render this one (and still scrolling is not smooth). But it works and I was not expecting a modern full featured browser to be even remotely usable on 175 MHz CPU without DRI.