Added integer based `sin()/cos()` functions, changed all trig functions to wled_math

DedeHai commented 1 month ago

I wrote a new integer math based sine/cosine approximation function implementing Bhaskara I's approximation formula. It is blazing fast and quite accurate (almost 3 significant digits when converting to float).

new sin16_t() / cos16_t() are faster and more accurate than fastled versions
sin_approx() / cos_approx() are float wrappers for sin16_t() / cos16_t() and are accurate enough to replace sinf()/cosf() for all current WLED purposes
atan2() is used only in octopus to calculate center offset, new approximated version saves flash
tan(), atan(), asin(), acos(), floor(), fmod() are used only for sunrise/sunset calculation, using existing wled_math version saves flash
beatsinx() replacements are to make use of new sin16_t()/sin8_t() functions to reduce flash size
Extensively tested surnise/sunset calculation: deviation is 1min. max
Tested some of the relevant FX and found no visual difference: Julia, 2D Drift, Drift Rose, Ghost rider, Rotozoomer, Palette, Arc 1D expansion
total flash savings: ~7.4k

softhack007 commented 1 month ago

total flash savings: ~7.4k

Is this with or without audioreactive usermod? Because ArduinoFFT pulls in some trig functions (sin,cos,atan, sqrt) -so possibly this finally increases flash size because two variants of the function need to be linked into the final binary.

Edit: I have a fork of ArduinoFFT where "double" trigonometry functions are replaced with float functions (sinf, cosf, atanf). I thinks it's not possible to replace them with approximations ... FFT could become very noisy if we do, because the fourier transform of the window function must be as exact as possible. Any inaccuracy will add "ghost frequencies".

softhack007 commented 1 month ago

sin_approx() / cos_approx() are float wrappers for sin16_t() / cos16_t() and are accurate enough to replace sinf()/cosf() for all current WLED purposes

One question: is the approximated function also faster than sinf() and cosf()? In earlier tests I was suprised to find that the libm functions are really fast (faster than fastled sin16/cos16), at least on esp32 and ESP32-S3 with FPU.

softhack007 commented 1 month ago

I'll give it a try with pinwheel mapping in the next days.

Pinwheel is very sensitive to inaccuracies in sinf() and cosf(), and possibly leaves holes as a consequence.

DedeHai commented 1 month ago

Is this with or without audioreactive usermod?

Goot point. I checked and it is the same: ArduinoFFT uses cos(x) which is the double float implementation and sqrt() was not touched in this PR. btw: did you ever check if the Espressif DSP library functions can be used for FFT and if there is any speed/memory/precision advantage? According to the doc it has speed optimized FFT for ESP32 and ESP32 S3.

DedeHai commented 1 month ago

One question: is the approximated function also faster than sinf() and cosf()? In earlier tests I was suprised to find that the libm functions are really fast (faster than fastled sin16/cos16), at least on esp32 and ESP32-S3 with FPU.

Another good point. I did not actually measure that. But now I checked with rough test. Here are my findings: numbers are in function calls per second executed:

MCU	`sinf()`	`sin_approx()`
ESP32-C3	9k	75k
ESP32-S3	130k	184k

I am not 100% sure about the method I used so no warranty about correctness of these numbers. Using the same method for sin16() functions gave like a 50x speed increase, which seems impossible without the compiler playing some tricks.

So on an S3 the difference is (as expected) not as large thanks to the FPU but still an improvement. On S2, C3 (and ESP8266) the speed-gain is huge.

DedeHai commented 1 month ago

I'll give it a try with pinwheel mapping in the next days.

Pinwheel is very sensitive to inaccuracies in sinf() and cosf(), and possibly leaves holes as a consequence.

I am looking at pinwheel RN. The issue is not the precision of sin() cos() but the number of steps: I just tried with 128x16 and even with floats there are missing pixels. If I increase steps for XL, they are gone, even when using the integer version of sin() cos() (more testing required though)

edit: the missing pixels are also related to the Fixed_Scale value. any reason it is 512? it could be as large as 31bits / max where max is maxwidth or maxheight (which is a uint16). So it could be 2^15 or if we take into account that any matrix dimension beyond 4096 is nonsensical for WLED, it could be even larger but not future proof.

softhack007 commented 1 month ago

@dedehai pinwheel has a general problem with holes - we tested it until 56x56, so below 56x56 there should be no holes (using sinf()/cosf(), not the _t approximations from AC).

softhack007 commented 1 month ago

any reason it is 512? it could be [larger]

512 was just a "shot in the dark". You're right we could increase the value to 2^15 or higher.

DedeHai commented 1 month ago

@DedeHai pinwheel has a generic problem with holes - we tested it until 56x56, so below 56x56 there should be no holes (using sinf()/cosf(), not the _t approximations from AC).

I saw that and I am trying to work out the math why that is. Here is what I have so far:

(maxdim/2)*sin(2pi/steps) <= 1 -> requirement for no holes, i.e. maximum angle such that at the furthest distance its less than 1 pixel therefore: steps=(2*pi)/asin(1/(maxdim/2)) since asin(x) = x for x << 1 this resolves into: steps >= pi * maxdim now what comes into play is the integer math applied and the rounding errors adding up as it is done incrementally and not by multiplication. So steps needs to be increased to overcome that. That is what I am trying to figure out.

DedeHai commented 1 month ago

@softhack007 looking at the pinwheel some more. I have some fixes but also: why is the center calculated in such a complicated way (with shifting odd rays)? I removed all that code and to me it still looks the same. Can you explain the intentions?

softhack007 commented 1 month ago

@softhack007 looking at the pinwheel some more. I have some fixes but also: why is the center calculated in such a complicated way (with shifting odd rays)? I removed all that code and to me it still looks the same. Can you explain the intentions?

@dedehai the "odd rays" part was an invention from @Brandon502, to reduce the number of "full rays" by only drawing outer parts of odd-numbered rays. That was around 30% faster in our tests.

Its correct that omitting this optimization should not influence the look. But the odd-rays optimization should be kept as-is for performance reasons.

softhack007 commented 1 month ago

@softhack007 looking at the pinwheel some more. I have some fixes

@DedeHai would be nice if we find an improvement, because each time I get a larger setup, there are new holes in pinwheel. Looks like a kind of Moire effect.

I've found the "magic parameters" up to 128x128: see https://github.com/MoonModules/WLED/blob/a84216947bfb28488e58837d5a6732836be93d6c/wled00/FX_fcn.cpp#L833-L854

I'm really curious to know if there is a better way to avoid these inaccuracies, while preserving speed.

Brandon502 commented 1 month ago

@softhack007 looking at the pinwheel some more. I have some fixes

@DedeHai would be nice if we find an improvement, because each time I get a larger setup, there are new holes in pinwheel. Looks like a kind of Moire effect.

I've found the "magic parameters" up to 128x128: see https://github.com/MoonModules/WLED/blob/a84216947bfb28488e58837d5a6732836be93d6c/wled00/FX_fcn.cpp#L833-L854

I'm really curious to know if there is a better way to avoid these inaccuracies, while preserving speed.

How much memory can a expand1D type use? If it could work similar to jMap you could create a lookup map on the first call then never have to do the math again. Would never produce holes and likely be much faster, but the memory footprint might be too big for 128x128+

DedeHai commented 1 month ago

@Brandon502 I have sort of working code improvements: it may take a slight hit on loop speed but it work for larger sizes and reduces steps for smaller ones by dynamically calculating the required steps, resulting in an overall speed improvement. Also "pixel holes" are controlled better by reducing the step size, running more 'blind' loops (but setting less pixels): https://github.com/DedeHai/WLED/tree/0_15_pinwheel_improvement I have a basic concept on how to improve speed by also dynamically calculating required steps vs. current "radius" being drawn. While it is not perfect and a compromise between code-complexity and calculation speed (it needs to run on ESP32 C3 and ESP8266 as well). For the ESP32 and ESP32 S3 more calculations could be done in float to squeeze out more speed (by more accurately calculating if a pixel needs to be drawn or not). I will leave those improvements to the advanced devs at MM ;) I will update the branch later today and make a PR, we can continue the discussion there as this PR has a different purpose.

softhack007 commented 2 weeks ago

I was wondering if this should be included in the upcoming 0.15.0 release, maybe together with the new WLEDgetColorFromPalette function extracted from #4138 ?

We have "feature freeze" for 0.15.0 already, but I think this PR might be worth making an exception to the rule

it's well tested and well analysed already
it really improves visual quality of some effects, like lissajous
the new WLEDgetColorFromPalette function makes some effects (like octopus) look much smoother on larger setup.

@DedeHai @willmmiles @netmindz @blazoncek @Aircoookie what do you think?

DedeHai commented 2 weeks ago

I am not sure about this. These changes were not in any beta release and while I tested the changes and found no problem so far, I think this should first be run through beta testing, there may be hidden surprises on ESP8266 or large 1D setups.

blazoncek commented 2 weeks ago

IMO, no but there is no reason to take my answer into account.

softhack007 commented 2 weeks ago

I am not sure about this. These changes were not in any beta release

@DedeHai thanks I think your judgement as the original author is very important. I agree with with, better to get some confidence (beta testing) before releasing this (and #4138) to all users. 👍

So we'll merge all the PRs tagged as "optimization" after 0.15.0 is released, right?

DedeHai commented 2 weeks ago

I think the decision is up to @Aircoookie whether to push 0.15 release ASAP or add in these improvements.

If it were up to me I would add all pending improvements I and @blazoncek made as they improve FX speed and FPS and FX quality quite a bit but only if these are extensively tested as there are chances of introducing new bugs. By all I mean this PR as well as https://github.com/Aircoookie/WLED/pull/4138 https://github.com/Aircoookie/WLED/pull/4145 https://github.com/Aircoookie/WLED/pull/4245

Aircoookie / WLED

Added integer based `sin()/cos()` functions, changed all trig functions to wled_math #4181