TMU can't reliably read from MMAL camera frames on the RPi Zero

Seneral commented 4 years ago

Tiled rendering consists of multiple programs each accessing the TMU for reading and their own dedicated space in the VPM for writing.

qpu_debug_tiled demonstrates the tiling pattern, and VPM writing. All QPUs can simultaneously use their part of the VPM to write values. This works fine, even without mutex synchronization.

qpu_blit_tiled is structurally exactly the same but adds TMU load and writes that instead of the debug pattern. However, just adding the TMU loading instructions breaks the program. Uncommenting them and writing debug values makes the program work again. However, from what I gathered, it is not the timing that is the issue, some nop operations instead of the TMU access don't trigger the behaviour. Executing each programs right one after another however works fine, so the functionality is fine.

So I tried adding mutex to synchronize the QPUs, at several different stages - whole program, each line, each VPM access. The whole program mutex works, but only at low framerates (e.g. 10). Without mutex, that would break. This indicates the mutex does work to a degree. However, when increasing the framerate, the QPUs quickly (after a few frames) start overwriting the whole memory without reason. The mutex synchronizations on each line or even VPM access seem to only worsen this behaviour.

So there are three parts to this problem that I do not understand:

Why does accessing the TMU affect the VPM access? Or is it that with predictable timing the qpu programs previously, by chance, just never interfered?
Why does the mutex access break at higher framerates? It seems accessing it at high frequencies seems to break it, however I've seen others (e.g. gpu_fft) accessing the mutex in a multi-program environment each VPM access before.
And finally, is the mutex required at all when each qpu only uses a fixed, small part of the VPM exclusively? In the current qpu_blit_tiled, all qpus only use 4 vectors (so 12x4 =48 out of the 64 I reserved for user programs).

Any help is greatly appreciated. The referenced programs can be easily tested out with the commands found in commands.txt

Seneral commented 4 years ago

I made some changes to make debugging easier and did a lot of tests.

With frame size 384x288 and no split column code, there are now exactly three programs scheduled to run each frame. I added the parameter -q 100010001000 where a 0 writes 0b0001 and 1 0b0000 to the respective QPU reserve register, effectively allowing me to set which QPUs to run the program on - and in turn verifying whether a specific QPU is stalled (happens quite often, it turns out). Note I tested the following with 0b1111 and 0b1110 respectively as well, to ensure no other program interferes by using the mutex, without a difference. Finally I made several code blocks in the programs to more easily reference which code gives which results. Notice in these tests no mutex was used unless explicitly stated.

Test command: sudo ./QPUCV -c qpu_blit_tiled.bin -m tiled -d -w 384 -h 288 -f 40 -q 111111111111

Code 1

The reference debug straight from debug_tiled and works as expected

-q 100000000000 | 3 sequential | single QPU 1 color, as expected
-q 100010001000 | 3 parallel | 3 distinct QPU colors, as expected
-q 111000000000 | 3 parallel | 3 similar QPU colors, as expected

Code 2

Simple TMU testing code, values is not used, output is random VPM content

-q 100000000000 | 3 sequential | 3 times the same random VPM pattern, as expected
-q 100010001000 | 3 parallel | 3 distinct random VPM pattern, as expected (*)
-q 111000000000 | 3 parallel | 3 distinct QPU colors (*)

(*) It reliably stalls after several seconds to several tens of seconds. One time, a single QPUs recovered nearly instantly, causing the code to continue to run on that single QPU (seeing the same random VPM pattern three times), running reliably on the single QPU for several minutes. But usually, none of the three QPUs ever recover even after minutes and the QPU keeps stalling. However, it is NOT a hard stall, reenabling the QPU causes all QPUs to unstall.

Code 3

Same as Code 2 but with mutex around the TMU access - shouldn't be needed but does change some things

-q 100000000000 | 3 sequential | 3 times the same random VPM pattern, as expected
-q 100010001000 | 3 parallel | 3 distinct random VPM pattern, as expected (**)
-q 111000000000 | 3 parallel | 3 distinct QPU colors (**)

(**) It reliably stalls after only a few frames, so even faster than without mutex. Sometimes when once stalls, the code breaks in an interesting way. Instead of straight blocks, it outputs in angled stripes - this happens when the tgtStride is exactly 16 too low. Now I identified the add tgtPtr, tgtPtr, num16; before the .endr to be the cause - the relevant instruction using tgtPtr is only two instruction ahead. Now just slight variations to the surroundings of the line add tgtPtr, tgtPtr, num16; change the outcomes in interesting ways. Uncomment read vw_wait; or add a nop; in it's space to bring space between the two instructions and it doesn't break that way while stalling anymore. num16 by itself seems unaffected, replacing it with a ldi 16 does not change the behaviour, except in the way described below: The most interesting change here is probably adding one nop; before that line and one after. Suddenly, it doesn't stall immediately anymore, however the diagonal stripes still appear for a frame and flicker away multiple times, until finally stalling. Sometimes, it does some other random errors as well, like different spacing, rarely starting to overwrite the whole memory, or spamming VPMEWR (VPM Error Write Range). Sometimes, one QPU does not stall and keeps the program running alone without problems (since there's now no interference anymore). While I don't know why the mutex causes these things to happen, they do happen consistently, and thus it's safe to say me adding the mutex around the TMU is NOT correct - in which way, and why it apparently can change timing of the code ahead (and as a result making it do unintended things, like writing all kinds of wrong values) - I don't know.

Code 4

Same as Code 2 but now the TMU value is read into r0 and then written to the VPM, without unpacking

-q 100000000000 | 3 sequential | Binary camera image, because no unpacking
-q 100010001000 | 3 parallel | Binary camera image, because no unpacking (***)
-q 111000000000 | 3 parallel | Binary camera image, because no unpacking (***)

(***) Just like Code 2, it reliably stalls after several seconds to several tens of seconds.

Code 5

Same as Code 4 but ONE nop; has been added AFTER the read from r4 into r0. Shouldn't make a difference, but does

-q 100000000000 | 3 sequential | Binary camera image, because no unpacking
-q 100010001000 | 3 parallel | Binary camera image, because no unpacking (****)
-q 111000000000 | 3 parallel | Binary camera image, because no unpacking (*****)

(**) It reliably stalls, before the first frame even finishes (only 100-200 instructions done on QPU) up to very few frames (few 100000s of instructions). After it stalls, the image might appear fine, sometimes artifacts may occur, sometimes it generates a ton of errors - maybe due to overwriting memory, but somehow different - these errors spam both the SSH console and the HDMI console output. (***) Interestingly, this configuration allows it to run for a few seconds just fine, just like Code 2. When it stalls, then without errors, however the output shows it stops cleanly on all three QPUs at the same time, midframe, as there is a single line across the screen. Other combinations that allowed to run for a couple of frames are -q 101010000000, -q 100100100000, -q 100100100000

Code 6

Same as Code 4 but now the TMU value is unpacked into several ra registers and then written to the VPM. This results in different timing and behaves exactly like Code 5, except this outputs proper grayscale images

-q 100000000000 | 3 sequential | Grayscale camera image, as expected
-q 100010001000 | 3 parallel | Grayscale camera image, as expected (****)
-q 111000000000 | 3 parallel | Grayscale camera image, as expected (*****)

TL;DR

So this is a lot of information, but what I read from this - VPM is not the problem, the TMU access causes the stalls, until the QPU is reenabled. The timing of the code AROUND the TMU access is incredibly sensitive, small changes cause different stall and error behaviour. It does matter which QPUs and thus TMUs are used simultaneously, affecting stall behaviour. Adding mutexes around the TMU code introduces new errors on stall by seemingly messing up code timing and generally causes it to stall even faster, likely due to the timing changes also observed with Code 5 and 6.

Next tests

Next tests include making the framesize even smaller and using a constant image, and don't use cache clearing, so that the TMU always has a cache hit. Also I'll want to test the semaphores instead of the mutex, even though the current problems arise from TMU use and not mutex use. Finally, I'll have to test on another board and fresh SD to exclude other causes.