Open DedeHai opened 1 month ago
- total flash savings: ~7.4k
Is this with or without audioreactive usermod? Because ArduinoFFT pulls in some trig functions (sin,cos,atan, sqrt) -so possibly this finally increases flash size because two variants of the function need to be linked into the final binary.
Edit: I have a fork of ArduinoFFT where "double" trigonometry functions are replaced with float functions (sinf, cosf, atanf). I thinks it's not possible to replace them with approximations ... FFT could become very noisy if we do, because the fourier transform of the window function must be as exact as possible. Any inaccuracy will add "ghost frequencies".
sin_approx() / cos_approx()
are float wrappers forsin16_t() / cos16_t()
and are accurate enough to replacesinf()/cosf()
for all current WLED purposes
One question: is the approximated function also faster than sinf() and cosf()? In earlier tests I was suprised to find that the libm functions are really fast (faster than fastled sin16/cos16), at least on esp32 and ESP32-S3 with FPU.
I'll give it a try with pinwheel mapping in the next days.
Pinwheel is very sensitive to inaccuracies in sinf() and cosf(), and possibly leaves holes as a consequence.
Is this with or without audioreactive usermod?
Goot point. I checked and it is the same: ArduinoFFT uses cos(x) which is the double float implementation and sqrt() was not touched in this PR. btw: did you ever check if the Espressif DSP library functions can be used for FFT and if there is any speed/memory/precision advantage? According to the doc it has speed optimized FFT for ESP32 and ESP32 S3.
One question: is the approximated function also faster than sinf() and cosf()? In earlier tests I was suprised to find that the libm functions are really fast (faster than fastled sin16/cos16), at least on esp32 and ESP32-S3 with FPU.
Another good point. I did not actually measure that. But now I checked with rough test. Here are my findings: numbers are in function calls per second executed:
MCU | sinf() |
sin_approx() |
---|---|---|
ESP32-C3 | 9k | 75k |
ESP32-S3 | 130k | 184k |
I am not 100% sure about the method I used so no warranty about correctness of these numbers. Using the same method for sin16() functions gave like a 50x speed increase, which seems impossible without the compiler playing some tricks.
So on an S3 the difference is (as expected) not as large thanks to the FPU but still an improvement. On S2, C3 (and ESP8266) the speed-gain is huge.
I'll give it a try with pinwheel mapping in the next days.
Pinwheel is very sensitive to inaccuracies in sinf() and cosf(), and possibly leaves holes as a consequence.
I am looking at pinwheel RN. The issue is not the precision of sin() cos() but the number of steps: I just tried with 128x16 and even with floats there are missing pixels. If I increase steps for XL, they are gone, even when using the integer version of sin() cos() (more testing required though)
edit:
the missing pixels are also related to the Fixed_Scale
value. any reason it is 512? it could be as large as 31bits / max where max is maxwidth or maxheight (which is a uint16). So it could be 2^15 or if we take into account that any matrix dimension beyond 4096 is nonsensical for WLED, it could be even larger but not future proof.
@dedehai pinwheel has a general problem with holes - we tested it until 56x56, so below 56x56 there should be no holes (using sinf()/cosf(), not the _t approximations from AC).
any reason it is 512? it could be [larger]
512 was just a "shot in the dark". You're right we could increase the value to 2^15 or higher.
@DedeHai pinwheel has a generic problem with holes - we tested it until 56x56, so below 56x56 there should be no holes (using sinf()/cosf(), not the _t approximations from AC).
I saw that and I am trying to work out the math why that is. Here is what I have so far:
(maxdim/2)*sin(2pi/steps) <= 1
-> requirement for no holes, i.e. maximum angle such that at the furthest distance its less than 1 pixel
therefore:
steps=(2*pi)/asin(1/(maxdim/2))
since asin(x) = x for x << 1 this resolves into:
steps >= pi * maxdim
now what comes into play is the integer math applied and the rounding errors adding up as it is done incrementally and not by multiplication. So steps needs to be increased to overcome that. That is what I am trying to figure out.
@softhack007 looking at the pinwheel some more. I have some fixes but also: why is the center calculated in such a complicated way (with shifting odd rays)? I removed all that code and to me it still looks the same. Can you explain the intentions?
@softhack007 looking at the pinwheel some more. I have some fixes but also: why is the center calculated in such a complicated way (with shifting odd rays)? I removed all that code and to me it still looks the same. Can you explain the intentions?
@dedehai the "odd rays" part was an invention from @Brandon502, to reduce the number of "full rays" by only drawing outer parts of odd-numbered rays. That was around 30% faster in our tests.
Its correct that omitting this optimization should not influence the look. But the odd-rays optimization should be kept as-is for performance reasons.
@softhack007 looking at the pinwheel some more. I have some fixes
@DedeHai would be nice if we find an improvement, because each time I get a larger setup, there are new holes in pinwheel. Looks like a kind of Moire effect.
I've found the "magic parameters" up to 128x128: see https://github.com/MoonModules/WLED/blob/a84216947bfb28488e58837d5a6732836be93d6c/wled00/FX_fcn.cpp#L833-L854
I'm really curious to know if there is a better way to avoid these inaccuracies, while preserving speed.
@softhack007 looking at the pinwheel some more. I have some fixes
@DedeHai would be nice if we find an improvement, because each time I get a larger setup, there are new holes in pinwheel. Looks like a kind of Moire effect.
I've found the "magic parameters" up to 128x128: see https://github.com/MoonModules/WLED/blob/a84216947bfb28488e58837d5a6732836be93d6c/wled00/FX_fcn.cpp#L833-L854
I'm really curious to know if there is a better way to avoid these inaccuracies, while preserving speed.
How much memory can a expand1D type use? If it could work similar to jMap you could create a lookup map on the first call then never have to do the math again. Would never produce holes and likely be much faster, but the memory footprint might be too big for 128x128+
@Brandon502 I have sort of working code improvements: it may take a slight hit on loop speed but it work for larger sizes and reduces steps for smaller ones by dynamically calculating the required steps, resulting in an overall speed improvement. Also "pixel holes" are controlled better by reducing the step size, running more 'blind' loops (but setting less pixels): https://github.com/DedeHai/WLED/tree/0_15_pinwheel_improvement I have a basic concept on how to improve speed by also dynamically calculating required steps vs. current "radius" being drawn. While it is not perfect and a compromise between code-complexity and calculation speed (it needs to run on ESP32 C3 and ESP8266 as well). For the ESP32 and ESP32 S3 more calculations could be done in float to squeeze out more speed (by more accurately calculating if a pixel needs to be drawn or not). I will leave those improvements to the advanced devs at MM ;) I will update the branch later today and make a PR, we can continue the discussion there as this PR has a different purpose.
I was wondering if this should be included in the upcoming 0.15.0 release, maybe together with the new WLEDgetColorFromPalette function extracted from #4138 ?
We have "feature freeze" for 0.15.0 already, but I think this PR might be worth making an exception to the rule
@DedeHai @willmmiles @netmindz @blazoncek @Aircoookie what do you think?
I am not sure about this. These changes were not in any beta release and while I tested the changes and found no problem so far, I think this should first be run through beta testing, there may be hidden surprises on ESP8266 or large 1D setups.
IMO, no but there is no reason to take my answer into account.
I am not sure about this. These changes were not in any beta release
@DedeHai thanks I think your judgement as the original author is very important. I agree with with, better to get some confidence (beta testing) before releasing this (and #4138) to all users. 👍
So we'll merge all the PRs tagged as "optimization" after 0.15.0 is released, right?
I think the decision is up to @Aircoookie whether to push 0.15 release ASAP or add in these improvements.
If it were up to me I would add all pending improvements I and @blazoncek made as they improve FX speed and FPS and FX quality quite a bit but only if these are extensively tested as there are chances of introducing new bugs. By all I mean this PR as well as https://github.com/Aircoookie/WLED/pull/4138 https://github.com/Aircoookie/WLED/pull/4145 https://github.com/Aircoookie/WLED/pull/4245
I wrote a new integer math based sine/cosine approximation function implementing Bhaskara I's approximation formula. It is blazing fast and quite accurate (almost 3 significant digits when converting to float).
sin16_t() / cos16_t()
are faster and more accurate than fastled versionssin_approx() / cos_approx()
are float wrappers forsin16_t() / cos16_t()
and are accurate enough to replacesinf()/cosf()
for all current WLED purposesatan2()
is used only in octopus to calculate center offset, new approximated version saves flashtan(), atan(), asin(), acos(), floor(), fmod()
are used only for sunrise/sunset calculation, using existing wled_math version saves flashbeatsinx()
replacements are to make use of newsin16_t()/sin8_t()
functions to reduce flash size