Seneral / VC4CV

VideoCore IV Computer Vision framework and examples for the RaspberryPi Zero - GL- and QPU-based
MIT License
14 stars 3 forks source link

TMU can't reliably read from MMAL camera frames on the RPi Zero #1

Open Seneral opened 4 years ago

Seneral commented 4 years ago

Tiled rendering consists of multiple programs each accessing the TMU for reading and their own dedicated space in the VPM for writing.

qpu_debug_tiled demonstrates the tiling pattern, and VPM writing. All QPUs can simultaneously use their part of the VPM to write values. This works fine, even without mutex synchronization.

qpu_blit_tiled is structurally exactly the same but adds TMU load and writes that instead of the debug pattern. However, just adding the TMU loading instructions breaks the program. Uncommenting them and writing debug values makes the program work again. However, from what I gathered, it is not the timing that is the issue, some nop operations instead of the TMU access don't trigger the behaviour. Executing each programs right one after another however works fine, so the functionality is fine.

So I tried adding mutex to synchronize the QPUs, at several different stages - whole program, each line, each VPM access. The whole program mutex works, but only at low framerates (e.g. 10). Without mutex, that would break. This indicates the mutex does work to a degree. However, when increasing the framerate, the QPUs quickly (after a few frames) start overwriting the whole memory without reason. The mutex synchronizations on each line or even VPM access seem to only worsen this behaviour.

So there are three parts to this problem that I do not understand:

  1. Why does accessing the TMU affect the VPM access? Or is it that with predictable timing the qpu programs previously, by chance, just never interfered?
  2. Why does the mutex access break at higher framerates? It seems accessing it at high frequencies seems to break it, however I've seen others (e.g. gpu_fft) accessing the mutex in a multi-program environment each VPM access before.
  3. And finally, is the mutex required at all when each qpu only uses a fixed, small part of the VPM exclusively? In the current qpu_blit_tiled, all qpus only use 4 vectors (so 12x4 =48 out of the 64 I reserved for user programs).

Any help is greatly appreciated. The referenced programs can be easily tested out with the commands found in commands.txt

Seneral commented 4 years ago

I made some changes to make debugging easier and did a lot of tests.

With frame size 384x288 and no split column code, there are now exactly three programs scheduled to run each frame. I added the parameter -q 100010001000 where a 0 writes 0b0001 and 1 0b0000 to the respective QPU reserve register, effectively allowing me to set which QPUs to run the program on - and in turn verifying whether a specific QPU is stalled (happens quite often, it turns out). Note I tested the following with 0b1111 and 0b1110 respectively as well, to ensure no other program interferes by using the mutex, without a difference. Finally I made several code blocks in the programs to more easily reference which code gives which results. Notice in these tests no mutex was used unless explicitly stated.

Test command: sudo ./QPUCV -c qpu_blit_tiled.bin -m tiled -d -w 384 -h 288 -f 40 -q 111111111111

Code 1

The reference debug straight from debug_tiled and works as expected

-q 100000000000 | 3 sequential | single QPU 1 color, as expected
-q 100010001000 | 3 parallel | 3 distinct QPU colors, as expected
-q 111000000000 | 3 parallel | 3 similar QPU colors, as expected

Code 2

Simple TMU testing code, values is not used, output is random VPM content

-q 100000000000 | 3 sequential | 3 times the same random VPM pattern, as expected
-q 100010001000 | 3 parallel | 3 distinct random VPM pattern, as expected (*)
-q 111000000000 | 3 parallel | 3 distinct QPU colors (*)

(*) It reliably stalls after several seconds to several tens of seconds. One time, a single QPUs recovered nearly instantly, causing the code to continue to run on that single QPU (seeing the same random VPM pattern three times), running reliably on the single QPU for several minutes. But usually, none of the three QPUs ever recover even after minutes and the QPU keeps stalling. However, it is NOT a hard stall, reenabling the QPU causes all QPUs to unstall.

Code 3

Same as Code 2 but with mutex around the TMU access - shouldn't be needed but does change some things

-q 100000000000 | 3 sequential | 3 times the same random VPM pattern, as expected
-q 100010001000 | 3 parallel | 3 distinct random VPM pattern, as expected (**)
-q 111000000000 | 3 parallel | 3 distinct QPU colors (**)

(**) It reliably stalls after only a few frames, so even faster than without mutex. Sometimes when once stalls, the code breaks in an interesting way. Instead of straight blocks, it outputs in angled stripes - this happens when the tgtStride is exactly 16 too low. Now I identified the add tgtPtr, tgtPtr, num16; before the .endr to be the cause - the relevant instruction using tgtPtr is only two instruction ahead. Now just slight variations to the surroundings of the line add tgtPtr, tgtPtr, num16; change the outcomes in interesting ways. Uncomment read vw_wait; or add a nop; in it's space to bring space between the two instructions and it doesn't break that way while stalling anymore. num16 by itself seems unaffected, replacing it with a ldi 16 does not change the behaviour, except in the way described below: The most interesting change here is probably adding one nop; before that line and one after. Suddenly, it doesn't stall immediately anymore, however the diagonal stripes still appear for a frame and flicker away multiple times, until finally stalling. Sometimes, it does some other random errors as well, like different spacing, rarely starting to overwrite the whole memory, or spamming VPMEWR (VPM Error Write Range). Sometimes, one QPU does not stall and keeps the program running alone without problems (since there's now no interference anymore). While I don't know why the mutex causes these things to happen, they do happen consistently, and thus it's safe to say me adding the mutex around the TMU is NOT correct - in which way, and why it apparently can change timing of the code ahead (and as a result making it do unintended things, like writing all kinds of wrong values) - I don't know.

Code 4

Same as Code 2 but now the TMU value is read into r0 and then written to the VPM, without unpacking

-q 100000000000 | 3 sequential | Binary camera image, because no unpacking
-q 100010001000 | 3 parallel | Binary camera image, because no unpacking (***)
-q 111000000000 | 3 parallel | Binary camera image, because no unpacking (***)

(***) Just like Code 2, it reliably stalls after several seconds to several tens of seconds.

Code 5

Same as Code 4 but ONE nop; has been added AFTER the read from r4 into r0. Shouldn't make a difference, but does

-q 100000000000 | 3 sequential | Binary camera image, because no unpacking
-q 100010001000 | 3 parallel | Binary camera image, because no unpacking (****)
-q 111000000000 | 3 parallel | Binary camera image, because no unpacking (*****)

(**) It reliably stalls, before the first frame even finishes (only 100-200 instructions done on QPU) up to very few frames (few 100000s of instructions). After it stalls, the image might appear fine, sometimes artifacts may occur, sometimes it generates a ton of errors - maybe due to overwriting memory, but somehow different - these errors spam both the SSH console and the HDMI console output. (***) Interestingly, this configuration allows it to run for a few seconds just fine, just like Code 2. When it stalls, then without errors, however the output shows it stops cleanly on all three QPUs at the same time, midframe, as there is a single line across the screen. Other combinations that allowed to run for a couple of frames are -q 101010000000, -q 100100100000, -q 100100100000

Code 6

Same as Code 4 but now the TMU value is unpacked into several ra registers and then written to the VPM. This results in different timing and behaves exactly like Code 5, except this outputs proper grayscale images

-q 100000000000 | 3 sequential | Grayscale camera image, as expected
-q 100010001000 | 3 parallel | Grayscale camera image, as expected (****)
-q 111000000000 | 3 parallel | Grayscale camera image, as expected (*****)

TL;DR

So this is a lot of information, but what I read from this - VPM is not the problem, the TMU access causes the stalls, until the QPU is reenabled. The timing of the code AROUND the TMU access is incredibly sensitive, small changes cause different stall and error behaviour. It does matter which QPUs and thus TMUs are used simultaneously, affecting stall behaviour. Adding mutexes around the TMU code introduces new errors on stall by seemingly messing up code timing and generally causes it to stall even faster, likely due to the timing changes also observed with Code 5 and 6.

Next tests

Next tests include making the framesize even smaller and using a constant image, and don't use cache clearing, so that the TMU always has a cache hit. Also I'll want to test the semaphores instead of the mutex, even though the current problems arise from TMU use and not mutex use. Finally, I'll have to test on another board and fresh SD to exclude other causes.

Seneral commented 3 years ago

So it turns out the main error seen here wasn't in the QPU program code at all. The camera emulation code proved it worked just fine even executing in parallel, so it had to be the camera handling (executed the tiled blit program at 480p@250fps on emulated camera buffers). This lead me to keep the camera frame buffer locked, which effectively eliminated the crashes, although I do not know why, since the buffer should not be used by the camera framework at all. Still, even now the code stalls after a few seconds when using real camera frames.

To Do:

  1. Test if custom buffers work while the camera is running to determine if the error is with the camera buffers being used, or the camera running at all (e.g. due to ISP issuing QPU user programs itself, which would mess up the qpu_program.c direct execution).
  2. Test if copying camera buffer to custom buffer before execution works, depending on results above.
  3. Double-check that all buffers from the camera code are handled correctly and none are fed back to the camera pipeline prematurely
  4. Disable some ISP steps which might issue QPU user programs itself
Seneral commented 3 years ago

So I tested above ideas, and the results are not very helpful. The tiled rendering works absolutely fine on emulated buffers even while the camera is running with all ISP blocks, so it's not some QPU programs or the camera pipeline interfering by itself. I did a low-effort CPU copy from the camera buffer to the emulated buffer each frame and was able to run the tiled rendering on the camera frame at 480p30 (would have stalled nearly immediately if I directly used the camera buffer without the intermediary blit). So what's left is that something about the camera buffers themselves is different, perhaps they are still accessed from somewhere else and that interferes with multiple QPU cores accessing the data but not with 1. the CPU accessing it nor 2. a single QPU core accessing it.

I'm thinking about experimenting with the frames a bit further, when I get time, namely:

  1. Return the camera buffer to the camera pipeline BEFORE I execute the QPU program, so that it is unlocked on the CPU side, so then I can choose to lock it on the GPU side or not and hopefully that makes a difference
  2. Delay each camera buffer for a couple of frames to make sure any potential processes using them are finished
  3. Replace emulated camera buffers, which are currently normal allocated buffer that the QPU can access, with empty camera frame buffers that are currenty unused by the camera pipeline
  4. Following 3, copying the current camera frame into that other buffer, just like I did with the emulated buffers
  5. Try to add some MMAL components after the camera output, e.g. a rescaler, before feeding it to the QPU
Seneral commented 3 years ago

So I tried 1. and 2. so far, no difference.

Seneral commented 3 years ago

Just tested on a Raspberry Pi 3 B+ and it worked without a problem.... For reference, previously I only worked with a Zero W and Zero 1.3s, since they are my target board. So I have no idea why this is the case, but that means that the code isn't completely at fault, and gives me a new lead, at the very least.

Seneral commented 3 years ago

Added qpu_mask_tiled in the meantime, works fine on the 3 B+, but stalls immediately on the Zero. Added a test branch with minimal code: https://github.com/Seneral/VC4CV/tree/qpudebug

Example from minimal code (can only execute qpu_mask_tiled):

$ sudo ./QPUMin -c qpu_mask_tiled.bin -q 010000000000
SETUP: 10 instances processing 1/2 columns each, covering 80x60 tiles, plus 0 columns dropped
-- QPU Enabled --
QPUs 1-4: 15 | 14 | 15 | 15
QPUs 5-8: 15 | 15 | 15 | 15
QPUs 9-12: 15 | 15 | 15 | 15
-- Camera Stream started --
QPU stalled - waiting to execute 10 / 10!
     Cycles: 510414826 idle | 35394042 vert | 0 frag | 37 instructions
     Clock 250MHz - 300MHz | 6.5% total load | 7.8% load for used QPUs | 37.9°C SoC temp
     37 instructions | 35394042 load cycles | 0.0% stalls
     TMU: 0.0% realistic bandwidth usage | 4.8 average stall cycles | 40.0% cached
     VPM: 0.0% VDW stall | 0.0% VCD stall (average of program cycles)
     L2C: 100.00% miss (0 hits | 8 misses)
Encountered an error after 0 frames!
-- Camera Stream stopped --
-- QPU Disabled --
Seneral commented 3 years ago

Reduced the test case even further, a program that works on the 3 B+ but stalls on the Zero (although a lot of information is lost compared to the above branch, so I've seperated it into qpuminimal). It also includes a switch to emulated buffers again, so that it can be verified that the program works fine on emulated VCSM buffers, but stalls on camera buffers, even with the camera running alongside it. Minimal main.cpp Minimal qpu_tmu_read.asm that only reads a block of memory using the TMU.

Seneral commented 3 years ago

For completeness sake, I currently circumvent this bug in my use case by first copying the camera buffer to a custom VCSM buffer before processing with the TMU. The copying process is done by using the VPM DMA Write and Read, which doesn't have any problem accessing the camera frame buffers. Unfortunately this nearly doubles the frametimes on the QPU. But for the actual algorithm the IO (TMU+VDW) and computation was well balanced, the QPU was never starved of input (no stalls) but the IO capacity of the TMUs was pretty much exhausted. Redesigning it with VDR as input would reduce the total frametime by making the workaround unnecessary, but it would be slower than the TMU+VDW alone, so I still have interest in fixing this issue.

Seneral commented 2 years ago

So I have finally circumvented this bug within a couple hours with an idea I had for months but only got around to implement now. I switched from MMAL to a V4L2 backend due to OV9281 drivers, so I only implemented this for V4L2, but supplying the VCSM buffers myself by allocating with vcsm_malloc, then getting the DMAFD using vcsm_export_dmabuf, and supplying that to the V4L2_MEMORY_DMABUF buffers during queueing. Will update the repo after I'm done with the main project.