Yona-Appletree / LEDscape

Beagle Bone Black cape and firmware for driving a large number of WS281x LED strips.
126 stars 58 forks source link

Very rare but possible glitching on PRU signal generation that can cause unexpected flashes #49

Open bigjosh opened 7 years ago

bigjosh commented 7 years ago

While the vast, vast majority of 0 bits coming out of the PRU are 300ns-370ns wide, I am seeing a very rare case where a 0 bit can be as wide as 540ns, which is wide enough to be seen as a 1 by some WS2812B chips.

When the problem happens, it seems to stretch all output bits being transmitted at that moment, although there is only material impact on 0 bits since 1 bits just become slightly longer 1 bits.

Outwardly, this appears as a row of pixels flashing for a single frame. It is especially noticeable when running strings in demo mode "black" when all bits should be 0. It is possible this is only visible on WS2812B chips with a shorter-than-spec T1H minimum time.

I verified the problem by attaching a scope to an output and setting to trigger on minimum pulse width of 450ns. Then I ran the "black" demo mode. In this mode, all bits should be 0 so I should never see a pulse wider than 450ns. Yet I was (rarely) able to capture pulses as wide as 540ns.

The stretched bits seem to happen more frequently when the ARM is under heavy memory stress so I think this might be caused by a worst-case series of cache misses when the PRU accesses the data in ARM RAM.

The current approach of timing the bit phases uses the cycle counter. Is it possible that the cycle counter does not not count cycle where the PRU is stalled because it is waiting for a cache miss when reading external RAM? The STALL COUNT register possibly indicates this...

STALLCOUNT This value is incremented by 1 for every cycle during which the PRU is enabled and the counter is enabled (both bits ENABLE and COUNTENABLE set in the PRU control register), and the PRU was unable to fetch a new instruction for any reason.

Possible solutions might include...

  1. Rearrange current code so that all of the accesses to external RAM occur between bits rather than during the T0H phase of the bits. This would still add jitter to the time between bits when cache misses occur, but as long as this time is less than RESET, then the only impact should be (very) slightly diminished performance rather than bad data.
  2. Rewrite PRU code to copy pixel data into PRU RAM first and then transmit the bits directly from local PRU RAM during the timing sensitive frame.
  3. Rewrite the PRU code to use the IEP_TIMER to time the signal phases rather than the cycle counter. The IEP_COUNTER seems to be able to run deterministicly at 200MHz no matter what is happening with PRU accesses.

I can try to tackle either of these approaches, but just want a sanity check before doing the work. Has anyone else ever seen these wide bits (or the flashes they produce)?

bigjosh commented 7 years ago

Digging in further on this, I think the ultimate source of the jitter is the fact that the PRU code accesses the GPIO pins though the ARM address space rather than directly via r30. This can cause stalls when there is contention, and I think these stalls are the root problem. With this in mind, I think the best solution might be to rewrite the PRU stuff to go direct to the pins and get absolute deterministic timing. This would not be as much work as it might seem now that the PRU C compiler is getting mature, but I am hesitant to do it if I am really the only one who had even been effected by this issue....

Yona-Appletree commented 7 years ago

Hey, thanks for looking into this, and I'm sorry I didn't get back to you sooner.

I am aware of the issue, and have done several things to mitigate it on various branches. The simplest one, that I think is on master, is simply to check how many cycles have passed since the zero write started, and if it's too long, we abort the entire frame. This has the effect of only showing one white pixel rather than corrupting the rest of the frame.

Secondly, you can rewrite the PRU code to not go back to DRAM for every bit, but rather load data in entire RGB chunks into the registers and write that. I have an experimental branch, spi-cape-support, that supports this along with several other improvements. The main change is that a custom PRU program is built at runtime for the specific number of LEDs and driver type.

The problem with using r30 is that you're quite limited in the number of channels you can output. Something like 12. That's far too few for my use cases, but I can certainly see the value for some projects. Combining that with loading all data into the registers should be pretty foolproof, though you still might have to drop a partial frame if it takes too long to get the next pixel of data. Using PRU RAM might fix this, though I had some issues with it when I tried.

I'd be happy to talk with you via phone about your investigation and my ideas and work. Feel free to email me at lightatplay@gmail.com if you're interested. I have some time tonight.

bigjosh commented 7 years ago

Your traces look very similar to mine (except you have a nicer scope :) ).

The fact that they are quantized to 10ns steps definitely suggest some wait states getting thrown in during T0H.

I am like 75% sure that this is happening due to contention on the L3/L4 interconnects when the PRU is accessing the memory for the GPIO pins though the ARM address space. If so, then I see a few solutions:

1) Make a new PRU driver that talks directly to the pins though R30. This is straightforward and guaranteed to work, but has the downside that we would be limited to only 24 pins (and therefore 24 strings) because that is the maximum number of PRU pins that are available on the BBB headers. For many applications (including mine) the limited number of pins would be ok. With this solution it would also be pretty easy to do some cool stuff like moving the temporal dithering into the PRU code which could save some load on the main Linux processor and possibly improve frame rates slightly.

2) Dig in deep on the OMAP L3/L4 interconnects and try to figure out a way to make our PRU to GPIO accesses more deterministic. This would let us continue to use all available GPIO pins, but has the downside that I don’t know anything about this stuff so would have to try and learn it. There looks like there are registers that control this stuff, but after much googling I cannot find any good documentation on how it all works.

3) Switch to a DMA-based signal generation scheme instead of the PRU. This again would let us continue to use all available GPIO pins and have low load on the main CPU, but has the downside that (a) the DMA channels might end up suffering from the same non-determinism as the current scheme , and (b) kinda changes what the whole LEDscape project is about.

I’ll probably attack the solutions in the order listed above when I have time in the next month or so, but could be convinced to start with #2 if anyone has any pointers to better info in the interconnects that could give me some encouragement. There must be some people somewhere who understand this stuff, but I don’t know how to find them.

-josh

From: Mark Renouf [mailto:notifications@github.com] Sent: Friday, December 23, 2016 8:02 PM To: Yona-Appletree/LEDscape LEDscape@noreply.github.com Cc: Josh Levine github@junk.josh.com; Author author@noreply.github.com Subject: Re: [Yona-Appletree/LEDscape] Very rare but possible glitching on PRU signal generation that can cause unexpected flashes (#49)

To add to what bigjosh reported, here's a clear picture of the issue. It's not uncommon at all, it's very easy to see the glitch by running 'black' demo with a length of 1, so each pin outputs a single 24bit sequence, using a pulse trigger to sync on the frame start.

This is a capture of the trailing edge of the first pulse, with persistence set to infinity, which clearly shows a jitter of between 10 and 80ns. It seems fairly random, but across a long string the additive error can become quite large.

It sounds like you have a cause and some ideas of how to fix it... far beyond my capabilities right now but I wanted to chime in and let you know it's very common -- though I'm not sure I saw actual gitching when testing with a 5m string of WS2812B's (maybe mine are better spec'd?).

https://cloud.githubusercontent.com/assets/52987/21464361/e5e75f1a-c948-11e6-8bb2-dbf153ace887.png

Zoom on trailing edge of previous, with timing cursors: https://cloud.githubusercontent.com/assets/52987/21464362/ecaed7b0-c948-11e6-9083-c63417438b88.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Yona-Appletree/LEDscape/issues/49#issuecomment-269061477 , or mute the thread https://github.com/notifications/unsubscribe-auth/AFQ7magb4Ri4uTe6-kYEzh71Paq_6uqWks5rLG8XgaJpZM4KwmrX . https://github.com/notifications/beacon/AFQ7mQxaKAm7oYy8E0rNS-QLN5biLtn2ks5rLG8XgaJpZM4KwmrX.gif

bigjosh commented 7 years ago

I am pretty sure this is the root of the problem...

" The PRU read instruction executes in ~2 cycles, plus additional latencies due to traversing through interconnect layers and variable processing loads. "

http://processors.wiki.ti.com/index.php/AM335x_PRU_Read_Latencies

I think it will take some deep digging in the bowels of the on-chip LAN to reduce these jitters.

Yona-Appletree commented 7 years ago

It's worth noting, however, that it's the writes that directly affect the GPIO jitter:

The PRU write instruction is a fire-and-forget command that executes in ~1 cycle.

The problem with this is that we can't even tell how long it took for the write to get to the GPIO register, so we can't account for the jitter.

With the version of the code on master, we are reading all the data for every bit, which could cause issues due to long reads (and there are checks to abort the strip write in this case).

I suspect the only real solution is to use r30. I have a prototype of an r30-based ws281x driver working on the spi-cape-support branch. Going forward, I'd like to merge all that into master and call out in the docs that there is the 48-port-capable-but-slightly-janky version and the 22-port-but-stable version. More testing is required before we're at that point, though.

bigjosh commented 7 years ago

Ah, yes. Humbly corrected- the jitter on the write side is totally different (and invisible!), but I still think ultimately depends on the interconnect fabric priorities.

R30 a great solution for me since I never need that many pins. Any feel on how important more pins are in general?

Yona-Appletree commented 7 years ago

Good question! I honestly don't have a good idea of who is using LEDscape right now (other than you!), and what their needs are :) That would be nice info to have, though.

Serisium commented 7 years ago

Personally, I use LEDScape to drive 23 separate strips from the BBB, have disabled HDMI, and don’t use any other GPIO lines.

Yona-Appletree commented 7 years ago

Well, if you could bring that down to 22, you could use all r30 pins. Technically you can also disable the eMMC, but that's a little harder to deal with.

orangemelon69 commented 6 years ago

Hi Yona!

Big fan of your work!

I went from one LED fun several months ago to custom produced rigid board matrixes for displaying lots of dynamic data on industrial machines. I went through all the initial things of playing with arduino simple stuff not knowing how to even solder and now playing with oscilloscopes:) Big Josh and the great work of his brought me to your fork here actually:)

This flicker issue is the last piece in my puzzle. And that finally brought me here to this thread.

My LED type is SK6812 which have a slighty different timing namely the 1’s in question are shorter therefore I believe this flicker problem is a lot more pronounced for my case.

I don’t use any cape or level shifter, I brought the voltage down to 4.3 (saving power and reducing brightness which is a plus in my case) and the signals register just fine altogether.

I will take a look at the prototype driver in the branch you mentioned tonight. Would love to contribute further somehow in case you wanted to merge that into master. Or if there are any news regarding this, I would be glad if you let us know.

Would shortening of the time help here (according to the SK6812 specs and their timing threaholds)? I tried to look at the templates but the machine code is too low level and therefore at this stage below my comprehension:)

Cheers

Yona-Appletree commented 6 years ago

Thanks for kind words, and I’m glad things are (mostly) working for you.

How many strips are you driving? If you can get by with 22 outputs, you can easily use my rewrite to use the direct PRU GPIO access. This helps substantially with the flicker.

At this point, the easiest thing to do is just give you one of my pre-built linux images that has everything set up correctly.

~ Yona

On Mar 5, 2018, at 13:11, orangemelon69 notifications@github.com wrote:

Hi Yona!

Big fan of your work!

I went from one LED fun several months ago to custom produced rigid board matrixes for displaying lots of dynamic data on industrial machines. I went through all the initial things of playing with arduino simple stuff not knowing how to even solder and now playing with oscilloscopes:) Big Josh and the great work of his brought me to your fork here actually:)

This flicker issue is the last piece in my puzzle. And that finally brought me here to this thread.

My LED type is SK6812 which have a slighty different timing namely the 1’s in question are shorter therefore I believe this flicker problem is a lot more pronounced for my case.

I don’t use any cape or level shifter, I brought the voltage down to 4.3 (saving power and reducing brightness which is a plus in my case) and the signals register just fine altogether.

I will take a look at the prototype driver in the branch you mentioned tonight. Would love to contribute further somehow in case you wanted to merge that into master. Or if there are any news regarding this, I would be glad if you let us know.

Would shortening of the time help here (according to the SK6812 specs and their timing threaholds)? I tried to look at the templates but the machine code is too low level and therefore at this stage below my comprehension:)

Cheers

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Yona-Appletree/LEDscape/issues/49#issuecomment-370567030, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDpokTNIkpYj4vE21P2a0B1ZPPy7uN6ks5tbanogaJpZM4KwmrX.

orangemelon69 commented 6 years ago

Speak about lightning fast replies:)

I am driving 7 strips (up to cca 600px each but normally around 300-500)

7 outs is the max I will need for this use.

If you’d be so kind that would be just great!

Cheers

Yona-Appletree commented 6 years ago

Sorry this one isn’t so fast! Send me an email to lightatplay@gmail.com so we can arrange that.

On Mar 5, 2018, at 13:19, orangemelon69 notifications@github.com wrote:

Speak about lightning fast replies:)

I am driving 7 strips (up to cca 600px each but normally around 300-500)

7 outs is the max I will need for this use.

If you’d be so kind that would be just great!

Cheers

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Yona-Appletree/LEDscape/issues/49#issuecomment-370569436, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDpov0Tln1N6gXAeiN2Q5mwJsWjfD2wks5tbavSgaJpZM4KwmrX.