Check if sin/cos approximation via integers has sufficient accuracy and speed

Kuratius commented 9 months ago

It's possibly to compute a good approximation to the cosine of a number using a single float multiplication and 1 integer multiplication.

The implementation from this video for example https://www.youtube.com/watch?v=hffgNRfL1XY (see e.g. 8:02 )

can be modified to use only integers except for the very last step. Example:

#define SECOND_ORDER_COEFFICIENT 0.0000000010911122665310369f
uint32 a=3667592205

float b=1.f/a ( can be evaluated at compile time)
sint32 c= (sint32) (SECOND_ORDER_COEFFICIENT*a) (also compile time)

uint32 x=16382 (0 is 0 angle and 16382 is pi/4 ), values lower than -pi/4 or greater than pi/4 need to be done with symmetry transforms or the squareroot operation.

 float result=b * (a-c*x*x)

the compiler optimizes this a bit to turn 2 integer multiplies +1 float into 1 integer multiplications and 1 float multiplication. There are other possible values for a that can be chosen, potentially with less error than the one I used for this example, especially if the range of x is changed. The float multiply could also be replaced with a hardware divide if the angle doesnt have to be a float.

The current implementation seems to use a 8th order polynomial with doubles. https://github.com/Hydr8gon/sm64/blob/a4b5e942d2242603580687394e128989dab39e68/lib/src/math/cosf.c#L36 I'm also a bit confused as to why p[0] seems to be unused here.

Hydr8gon commented 9 months ago

This is interesting, but I'm wary because any variation in the math could slightly change the physics of the game. Maybe it can be added as a compile-time option if it provides a meaningful performance boost.

Kuratius commented 9 months ago

From what I understand sin/cos are used in other functions besides physics, so that could be looked at as well. Though I think the current implementation is already different from what the N64 does, since that supposedly uses a lookup table.

as long as cos^2 +sin^2=1 is kept it can probably be used for graphical functions and small things like rotations. It's also helpful that you could compute sin/cos at the same time and swap them as needed, with the other being done using a square root.

That being said, doing it via the preprocessor as you suggested seems like the best of both worlds.

It's technically also possible to increase the order of the polynomial, but I think it'll hit the limits of what integers can hold very fast, and I think in that case using a sine may be better as it reduces the degree of the polynomial approximation (4th degree vs third degree).

Basically as long as you can find a good rational approximation for the coefficients of the polynomial, you can then multiply by the denominator.

Kuratius commented 9 months ago

I wrote some code for finding the best rational approximation for the second order cosine polynomial with the condition that the best possible accuracy should be at pi/4 (you can get something like an order of magnitude better overall accuracy if you allow k=16384 is just what integer range the input angle should be, i.e. k should be equivalent to pi/4 in the angle units used.


from fractions import Fraction
import numpy as np
k=16384

a=(1-np.cos(np.pi/4))/(np.pi/4)**2 /((k/(np.pi/4))**2) 
x=np.linspace(0, k, k+1)
for i in range(1, 32):
    F1=Fraction.from_float(a).limit_denominator(2**i)
    if F1.denominator<=2**32 and abs((F1-a)/a)<0.01:
        d=F1.denominator
        print("Rational Approx: ", F1,"error: ", (F1-a)/a   , " a=", a)
        print("max err: ",max(1/d*d*a*(round(1/a)- x**2)-np.cos(x/k*np.pi/4) ), "endpoint err: ",  1/d*d*a*(round(1/a)- k**2)-np.cos(k/k*np.pi/4))

output:

Rational Approx:  1/916495974 error:  5.422899801529334e-10  a= 1.0911122665310367e-09
max err:  0.0038430014381050093 endpoint err:  -5.422898796680897e-10
Rational Approx:  2/1832991949 error:  -3.2661175053614083e-12  a= 1.0911122665310367e-09
max err:  0.0038430014381050093 endpoint err:  -5.422898796680897e-10

Also played around with doing a curve fit using chebyshev nodes, you can get about half the worst case error everywhere at the cost of pi/4 being the worst case.

I used the form (1/d*d*a)*(int(1/a)- x**2) because it has less multiplications. I think

1/d*(d- int(a*d)*x**2) is technically more faithful to what I had in mind originally (and it's what I used to find the constant in the first comment).

The price you pay for 1 less multiply is that the first endpoint at x=0 is not an exact match anymore and is instead 1/d*d*a*(int(1/a))=0.9999999994577101 (still 1 if you use 32 bit floats)

Edit: I just realized, but in this form the value of the denominator actually doesnt matter at all. I can probably go 1 order higher and see if I can fit an additional term in there.

Kuratius commented 9 months ago

https://github.com/Hydr8gon/sm64/blob/a4b5e942d2242603580687394e128989dab39e68/include/trig_tables.inc.c#L261

Found the table, I'll test against this for accuracy.

As for speed it probably depends on how fast loading a value from the table is vs computing the value, and that speed probably depends on whether the table is in the cache or in the main ram.

Probably cant test that without compiling the project on an actual machine. I have a dsi with custom firmware I can test with, I probably just need to figure out how to compile the project.

Can you tell me if the table is in the main ram or in the smaller cache? It could be that what I'm trying to do is a waste of time if the table is usually going to be in the cache & there is no benefit to be gained from freeing up space there.

Is the cosine function using floats that I linked in my first post used for the actual game or do you mostly use the interpolation table? I'm not 100 % sure why there are two implementations. I also still dont fully get why the cosine using floats doesnt seem to use one of the coefficients in the array P at all. If it's not used then why is it stored?

Kuratius commented 9 months ago

Ok so the table size is 4096 times 4 bytes for a float, so like 16 KB , and the cache is like 32 KB. The table is not going to be cached unless you dont need cache for anything else.

It could be checked if doing the hardware squareroot instruction using an already read cosine (or sine) to get the counterpart sine/cosine via cos^2+sin^=1 is faster than loading them from both tables.

Hydr8gon commented 9 months ago

I didn't write the game code, so I don't really know what's going on with it. I did the DS-specific stuff, and everything else was reverse-engineered from the N64 version to produce a 1:1 matching ROM. Sometimes the code ends up not using things or being weird in order to achieve that perfect match. As for the cache, if you're referring to TCM, I don't think it's used for anything at the moment. There's still the regular cache, which applies to everything but has the potential for cache misses.

Kuratius commented 9 months ago

Seems like it might be useful

https://www.nullhardware.com/blog/fixed-point-sine-and-cosine-for-embedded-systems/

Kuratius commented 9 months ago

http://www.coranac.com/2009/07/sines/

This has a section with assembly for nds/arm9

and includes benchmarks run on the nds.

Screenshot_20240101-051832_Firefox

Hydr8gon / sm64

Check if sin/cos approximation via integers has sufficient accuracy and speed #31