Store vga code and font data in RAM, as flash is too slow

jannic commented 1 year ago

It looks like the pixel generating loop is nearly exhausting the time budget, and there is no margin left for flash cache misses.

I had to increase the RAM reserved for the BIOS from 16k to 24k. This might be excessive. Perhaps it is possibe to extract a smaller selection of really timing critical code and move only that to RAM?

thejpster commented 1 year ago

Interesting. I've been meaning to capture the timing diagram. I don't suppose you have the numbers, or a nice graphic? e.g.

http://cliffle.com/blog/racing-the-beam/

jannic commented 1 year ago

No, unfortunately not. What I did was to make the effect really obvious: You mentioned that reading from the BMC triggered the glitches, so I ran such a loop on cpu0:

    loop {
        let bmc_ver = critical_section::with(|cs| {
                let mut lock = HARDWARE.borrow_ref_mut(cs);
                let hw = lock.as_mut().unwrap();
                hw.bmc_read_firmware_version();
        });
    }

With that, the display gets really wild. And everything is completely stable again after moving the VGA code to RAM.

Interestingly, another way to fix the display is using the cortex-m feature critical-section-single-core instead of the spinlock based critical-section from rp2040-hal. I'm not sure why that helps so much. It doesn't look like the VGA code uses any critical sections at all, and the interrupts on core1 should not be affected by a critical section on core0. Perhaps it's just because the critical section code happens to alias with some flash contents needed for the VGA output, triggering excessive cache misses?

thejpster commented 1 year ago

It also does this if you set DEFMT_LOG=trace. If you set it to DEFMT_LOG=error, the screen is fine - even through it's doing the same number of BMC reads as usual.

thejpster commented 1 year ago

Hmm, yes it's possible we're triggering flash cache misses.

thejpster commented 1 year ago

What's the maximum stack use on Core 1? I wonder if we can shrink the stack and sneak some more code in there.

thejpster commented 1 year ago

Or, as you observed on the Matrix chat, we could try moving the font into RAM and see if that helps. That's 4096 bytes and we want to support soft-fonts anyway.

jannic commented 1 year ago

It also does this if you set DEFMT_LOG=trace. If you set it to DEFMT_LOG=error, the screen is fine - even through it's doing the same number of BMC reads as usual.

With DEFMT_LOG=trace, there is some defmt output inside the poll loop, and defmt does use critical sections. So in that case it's obvious that critical sections done by the BMC reads will mess up the video timing.

thejpster commented 1 year ago

But the critical section should only stop interrupts on Core 0, and grab a spinlock which should always be free because Core 1 never grabs it?

jannic commented 1 year ago

That's the point, with the trace call in poll, core1 enters a critical section as well.

jannic commented 1 year ago

I did some simple timings by evaluating (*rp_pico::pac::DMA::PTR).ch[PIXEL_DMA_CHAN].ch_trans_count.read().bits() (ie. the number of DMA transfers left) at the end of the poll method. With the VGA code running from RAM, I get a minimum value of 45. (That should be 90 pixels, right? As every transfered word contains two pixels?)

Interestingly, the code seems to get faster (resulting in ~58 transfers left) when compiling with opt-level = "z".

Those are preliminary results, I have to repeat those measurements more carefully. But as I probably won't have time to do that today, I wanted to let you know what I tried.

thejpster commented 1 year ago

Neat trick. I was going to repurpose the I2C pins as extra GPIOs to mark when the render function completes. But your way is easier!

thejpster commented 1 year ago

Smaller code means less trashing the flash cache, I guess. Makes sense.

jannic commented 1 year ago

Both measurements were with code running from ram so it's not the flash cache. Either the speed is limited by ram bandwidth (not impossible when dma and CPU are both moving around data at high speed) or opt level z is actually faster on that CPU. As llvm doesn't know the exact number of cycles an instruction needs on this specific core, it's entirely possible that opt level 3 is worse than z.

thejpster commented 1 year ago

A thought, but maybe not a good one. Rather than copy the whole font from flash into RAM, maybe we could memcpy one row from it into RAM (so 256 bytes) at the start of the line. I think I arranged the font so it was row-wise, not glyph-wise (which would be more normal).

thejpster commented 1 year ago

Does this get any better now the flash is running at full speed?

jonathanpallant commented 1 year ago

I tried this patch set over the top of v0.4.1 and it still glitches the screen if I hold the Enter key down.

thejpster commented 1 year ago

The glitch may been been due to the scrolling (which is a memmove on the character array, completely unsynced with the vsync). When not scrolling the picture is rock solid.

I'll experiment a bit later but I think I want to merge this, even with the extra RAM usage.

Neotron-Compute / Neotron-Pico-BIOS

Store vga code and font data in RAM, as flash is too slow #25