adafruit / Adafruit_DotStar_Pi

DotStar module for Python on Raspberry Pi
GNU General Public License v3.0
60 stars 30 forks source link

Incorrect bitbang code in dotstar.c #17

Closed RTimothyEdwards closed 6 years ago

RTimothyEdwards commented 6 years ago

(Edit: Note that the following analysis is basically all wrong; just to let anybody reading this understand that the issue was a jump-start to an improved driver, but nobody should actually attempt to apply or use the code below, or learn anything about the workings of the Broadcom ARM chip GPIO therefrom. The driver has since been updated, and solves the problem.)

I have identified a problem with dotstar.c that prevents the bit-bang mode from ever working properly. The error is that gpioSet and gpioClr are pointers to 32-bit unsigned integers, but the pointers are set to the gpio memory map at gpio[7] and gpio[10], respectively. This produces a 1-byte overlap between the set and clear spaces. In practice, this is (was) showing up for me as "glitches" in the light strip. On occasion, some portion of the light strip would spontaneously light up the red pixels dimly, regardless of what was written. The random nature of the error suggested either a noise or timing issue. I tried adjusting the bit-bang clock pulse timing but could not make it go away. I tried filtering the data signal lines but that didn't cause the behavior to change at all. I finally concluded that the error was in the driver code. After finding the set and clear bit overlap, I decided that what was happening was that the set and clear bits were overwriting each other and/or another byte of the gpio where they were not supposed to.

To fix the problem, I replaced instances of gpioSet = mask; and gpioClr = mask; with gpioSet &= 0xff000000; gpioSet |= mask; and gpioClr &= 0xff000000; gpioClr |= mask;

I did this in clockPulse(), rawWrite(), and show(). The glitching has not occurred since.

By the way, the ultimate problem at the hardware level is caused by the data and clock rising edges happening at the same time. The data are best changed on the falling edge of the clock to achieve the highest possible data rate in bit bang mode. When this is done, there is no need to "throttle" the bit bang clock rate; it can keep up with whatever rate the clock can produce. This reduces the routine clockPulse() to the following:

static void clockPulse(uint32_t mask) {

    *gpioSet &= 0xff000000;
    *gpioSet |= mask;

    *gpioClr &= 0xff000000;
    *gpioClr |= mask;

}

With these corrections, the bit-bang mode can go full-bore. Some code optimizations could make it go even faster, but it's sort of pointless as it is already updating faster than the eye can follow.

I have noticed that all examples of Adafruit_DotStar_Pi use the SPI gpio lines, which is probably because everybody sees the bit-bang mode glitching and avoids it. That's unfortunate, because bit-bang mode can be very fast on the Pi-3 and allows for many light strips to be run simultaneously in parallel (I am working on a project with eight strips in parallel). The set and clear masks can accommodate as many parallel LED strips as there are available GPIO signals to drive, although that requires further extensions to the dotstar.c driver (which is a topic for another post).

---Tim

PaintYourDragon commented 6 years ago

Although some of your assumptions are a bit off, this gave me an "ah-ha" idea in fixing the clockPulse() function with its goofy and not-always-accurate ARM cycle counts. Try out the latest code, you can now optionally pass a bitrate argument when using bitbang mode (default is 8 MHz, reliable maximum varies but a Pi 3 can do 16 MHz OK).

the pointers are set to the gpio memory map at gpio[7] and gpio[10], respectively. This produces a 1-byte overlap gpio[] is of type unsigned, which is a system's native integer type (32 bits in the Pi's case). Since it's subscripting an int array, gpio[7] and gpio[10] are 12 bytes apart (8 bytes between them), so no overlap.

gpioSet &= 0xff000000; gpioSet |= mask; The GPIO set & clear registers are interesting things -- they allow 'atomic' bit set and clear operations without separate read-modify-write occurring. As such, values read from them may be gibberish, and any modify operations (e.g. AND or OR) are needlessly doing a read-modify-write on something that is inherently not really read-write-modifyable; the read returns gibberish, the modify modifies gibberish (though it WILL set the requested bits in the interim product), and it's really only the write that does anything (setting the requested bits -- the gibberish bits will most likely just be masked out since the associated pins aren't configured as GPIO outputs).

What you're seeing, and the reason the code likely works, is that reads and writes to the GPIO registers are wait-stated to a much slower clock. All those sets & clears are essentially NOPs, adding up to a more favorable delay than the idle loop that was in clockPulse() (which was empirical, and kind of crappy, and will be thrown off by different overclock settings). The masking operations aren't really doing anything constructive other than NOPing a bit slower and more reliably than the old code was NOPing.

This was the "ah-ha" and the basis of the fix, which is to use the inherently-slower GPIO writes themselves as the NOP throttle rather than trying to find just the right ARM cycle delays, which as we've seen is inherently flawed. (Also, though I think DotStars read the data line on a rising clock, and the rising clock in the old code was a separate GPIO write and thus slightly delayed, I'm not 100% certain of the read-on-rise (nor certain it's on the falling clock either), but aligned the timing of the rising/falling clock such that they're "centered" on the data bit while keeping a 50% duty cycle, so thanks for that idea too).

Getting the bitrate right(ish) was problematic, as the GPIO source clock (not do be confused with GPCLK, a different animal) seems to be tied to different system clocks and with different divisors depending on which model of Raspberry Pi (e.g. arm_freq / 10 on Pi 3, core_freq / 3 on Pi 2, arm_freq / 15 on Pi 1, or so it seemed). Rather than try to detect all these settings (which could change in future OS updates anyway), what I do on the first bitbang initialization is benchmark the GPIO write speed (writing 0 to the gpioSet register, which does nothing but kill time, but precisely one GPIO-write-worth of time) for a known interval (1/4 second) and use this to determine the loop timing in clockPulse(). (This also requires forcing the 'turbo' setting during bitbang transfers, so the timing matches in both the initial benchmark and later running cases...and makes me wonder if the reason you were seeing glitches might be that my system was configured with the scaling governor set to 'ondemand' -- would yours happen to be set to 'performance'? In which case it makes sense in hindsight that the timing loops in the old clockPulse() would be all wrong.)

RTimothyEdwards commented 6 years ago

My Pi 3 tells me it's using 'ondemand', so there goes that theory.

RTimothyEdwards commented 6 years ago

Your new code works very well compared to the original!

There is one error that I had to correct; I am calling the strip initialization eight times, once for each strip I have connected to GPIO. Only the first one works. The reason is that the settings for self->t0, self->t1, and self->t2 are inside the conditional block that only runs once to initialize GPIO. If "count" is changed to a global variable and the assignments for t0, t1, and t2 are moved outside of the conditional block, then everything works perfectly (as long as I don't overdrive it; it cannot keep up with the default SPI hardware speed, which is not especially surprising).

PaintYourDragon commented 6 years ago

Thanks for spotting that! Fixed now. Additionally, I changed clockPulse() to use do-while loops so it's always writing at least one low-high-low per bit (the while loops there previously would miss the 'high' if using very fast speeds on very slow Pi's). It might run at less than the requested speed in this situation, but the clock's at least valid which is more important.

Think it's all OK now, gonna mark this closed. Thank you for helping track this down!

RTimothyEdwards commented 6 years ago

All the thanks goes to you. I was thinking about extending the driver to handle up to at least eight light strips in parallel. However, I just threw together a simple Python script that allocates eight individual structures for the LED strips and drives them in sequence, and I don't think I would be able to notice the difference. Which means I probably could have gotten by with putting all eight strips in a single data chain and driving it from the SPI port and saved myself a lot of trouble. But, hey, it's a proof of concept, it ultimately led to better driver code, and sometime when I feel like writing the parallel driver, I should be able to speed it up by a factor of eight (less overhead, which is probably considerable).

light column project