Closed jannic closed 1 year ago
Interesting. I've been meaning to capture the timing diagram. I don't suppose you have the numbers, or a nice graphic? e.g.
No, unfortunately not. What I did was to make the effect really obvious: You mentioned that reading from the BMC triggered the glitches, so I ran such a loop on cpu0:
loop {
let bmc_ver = critical_section::with(|cs| {
let mut lock = HARDWARE.borrow_ref_mut(cs);
let hw = lock.as_mut().unwrap();
hw.bmc_read_firmware_version();
});
}
With that, the display gets really wild. And everything is completely stable again after moving the VGA code to RAM.
Interestingly, another way to fix the display is using the cortex-m
feature critical-section-single-core
instead of the spinlock based critical-section from rp2040-hal. I'm not sure why that helps so much. It doesn't look like the VGA code uses any critical sections at all, and the interrupts on core1 should not be affected by a critical section on core0. Perhaps it's just because the critical section code happens to alias with some flash contents needed for the VGA output, triggering excessive cache misses?
It also does this if you set DEFMT_LOG=trace. If you set it to DEFMT_LOG=error, the screen is fine - even through it's doing the same number of BMC reads as usual.
Hmm, yes it's possible we're triggering flash cache misses.
What's the maximum stack use on Core 1? I wonder if we can shrink the stack and sneak some more code in there.
Or, as you observed on the Matrix chat, we could try moving the font into RAM and see if that helps. That's 4096 bytes and we want to support soft-fonts anyway.
It also does this if you set DEFMT_LOG=trace. If you set it to DEFMT_LOG=error, the screen is fine - even through it's doing the same number of BMC reads as usual.
With DEFMT_LOG=trace
, there is some defmt output inside the poll
loop, and defmt does use critical sections. So in that case it's obvious that critical sections done by the BMC reads will mess up the video timing.
But the critical section should only stop interrupts on Core 0, and grab a spinlock which should always be free because Core 1 never grabs it?
That's the point, with the trace
call in poll
, core1 enters a critical section as well.
I did some simple timings by evaluating (*rp_pico::pac::DMA::PTR).ch[PIXEL_DMA_CHAN].ch_trans_count.read().bits()
(ie. the number of DMA transfers left) at the end of the poll
method.
With the VGA code running from RAM, I get a minimum value of 45. (That should be 90 pixels, right? As every transfered word contains two pixels?)
Interestingly, the code seems to get faster (resulting in ~58 transfers left) when compiling with opt-level = "z"
.
Those are preliminary results, I have to repeat those measurements more carefully. But as I probably won't have time to do that today, I wanted to let you know what I tried.
Neat trick. I was going to repurpose the I2C pins as extra GPIOs to mark when the render function completes. But your way is easier!
Smaller code means less trashing the flash cache, I guess. Makes sense.
Both measurements were with code running from ram so it's not the flash cache. Either the speed is limited by ram bandwidth (not impossible when dma and CPU are both moving around data at high speed) or opt level z is actually faster on that CPU. As llvm doesn't know the exact number of cycles an instruction needs on this specific core, it's entirely possible that opt level 3 is worse than z.
A thought, but maybe not a good one. Rather than copy the whole font from flash into RAM, maybe we could memcpy one row from it into RAM (so 256 bytes) at the start of the line. I think I arranged the font so it was row-wise, not glyph-wise (which would be more normal).
Does this get any better now the flash is running at full speed?
I tried this patch set over the top of v0.4.1 and it still glitches the screen if I hold the Enter
key down.
The glitch may been been due to the scrolling (which is a memmove on the character array, completely unsynced with the vsync). When not scrolling the picture is rock solid.
I'll experiment a bit later but I think I want to merge this, even with the extra RAM usage.
It looks like the pixel generating loop is nearly exhausting the time budget, and there is no margin left for flash cache misses.
I had to increase the RAM reserved for the BIOS from 16k to 24k. This might be excessive. Perhaps it is possibe to extract a smaller selection of really timing critical code and move only that to RAM?