Open Chiptun3r opened 6 months ago
Thanks so much for your PR, it's quite late right now, so I can't fully review it yet. I do have some bits to highlight first.
I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.
We tried using long multiplication and found it unacceptable at the time (#443). At the time:
Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).
Given the word size is 32 bits, is this an improvement?
I wrote the following "benchmark", note that the 1000 repetition isn't for measurement purposes, the GBA doesn't need that (no cache, no complex pipeline, no branch prediction, etc.), it's just to make any fixed overhead of calling the test function negligible.
use core::hint::black_box;
use agb_fixnum::{num, Num};
#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut crate::Gba) {
let a: Num<i32, 8> = black_box(num!(1_000.235));
let b: Num<i32, 8> = black_box(num!(1_000.235));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
}
#[test_case]
fn bench_fixed_num_multiplication_u8(_: &mut crate::Gba) {
let a: Num<u8, 4> = black_box(num!(1.2));
let b: Num<u8, 4> = black_box(num!(2.7));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
}
the results of which are
# master, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]
# master, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 48038 c ≈ 0 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]
# this pr, debug
agb::benches::bench_fixed_num_multiplication_i32...[ok: 111036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 34027 c ≈ 0 s]
# this pr, release
agb::benches::bench_fixed_num_multiplication_i32...[ok: 103036 c ≈ 0.01 s]
agb::benches::bench_fixed_num_multiplication_u8...[ok: 30027 c ≈ 0 s]
Note that these benchmarks may still be flawed, it is a micro-benchmark after all. They show the existing i32 multiplication to be significantly faster in both debug and release. While I'm not certain, I explain this by the overhead of calling an arm function from thumb and the shifting is more expensive. I also see that the existing u8 multiplication to be faster in debug and equivalent with this PR in release mode.
From your description, the rest all sounds great. I'll have a look at the code for it in more detail when I get the chance.
Thanks again!
I'll take a better look in the following days, but for now:
I implemented it with a bit of assembly, so we can take advantage of the gba cpu without any more hacks like before.
We tried using long multiplication and found it unacceptable at the time (#443). At the time:
* LLVM never inlines different instruction sets, this means that multiplication of constants by constant propagation could not be performed at compile time. * It benchmarked slower anyway.
That's unfortunate, I thought it would have add a bit of overhead for the jump from thumb to arm and vice versa, but still be faster than three MUL (+ stuff). In this case I'd propose to keep the old implementation for fixnum with 15 or less bits of precision, and mine for the others, since the old one is not correct for those cases anyway.
Now you can make fixnums with i8 as concrete type too (and i8 and u8 now uses only a i16 and u16 for the upcast multiplication).
Given the word size is 32 bits, is this an improvement? I thought it would may have given the compiler a bit more slack for some magic optimization I cannot even thing of. Anyway, I think the difference between debug and release just come from the new code that checks for overflow and panics in that case and has nothing to do with the upcast type, but I don't have a strong opinion if you want it back to i32/u32.
In any case, thanks a lot for the benchmarks, they are very useful.
@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)
@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :)
I gave a quick look at it and I noticed that you're relying on upcast to 64 bits for multiplications when you need to check if they are overflowing, and that is quite suboptimal. From my tests (same setup as @corwinkuiper's, in release mode):
agb_template::bench_fixed_num_multiplication_i32_fast...[ok: 48038 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_smull...[ok: 103036 c ≈ 0.01 s]
agb_template::bench_fixed_num_multiplication_i32_upcast...[ok: 169036 c ≈ 0.01 s]
Also, the fast solution is wrong with high precision numbers, since, if the multiplication of the fract parts overflow the result is just incorrect and not the wrapping behavior you would expect. Right now I have implemented a double solution (fast for less then 16 bits of precision, umull/smull otherwise), I'm just trying to figure out how to check if a overflow happened with the fast implementation (and then add some other tests for good measure) before pushing the changes.
Now it uses the fast path if the precision is <= 16 (in that case the multiplication between the fract parts can't overflow in any case, so the result is always the expected, wrapping, one). Tests:
#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut agb::Gba) {
let a: Num<i32, 8> = black_box(num!(100.235));
let b: Num<i32, 8> = black_box(num!(10.235));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
}
#[test_case]
fn bench_fixed_num_multiplication_u32_high_precision(_: &mut agb::Gba) {
let a: Num<u32, 18> = black_box(num!(100.235));
let b: Num<u32, 18> = black_box(num!(10.235));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
}
Results (debug):
agb_template::bench_fixed_num_multiplication_i32...[ok: 279113 c ≈ 0.02 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 111065 c ≈ 0.01 s]
Results (release):
agb_template::bench_fixed_num_multiplication_i32...[ok: 48049 c ≈ 0 s]
agb_template::bench_fixed_num_multiplication_i32_high_precision...[ok: 103047 c ≈ 0.01 s]
Sadly the debug build is much slower now (even tho the dev profile has opt-level = 3
as default in the template repo), if someone wants to try to speed it up they're welcome.
I also noticed that the division is broken too for high precision numbers (I guess it needs upcasts too), but that's for another time.
@gwilymk I haven't pushed any of your latest commits from #711, you can do it after this one or I can add them here if you prefer.
Thanks for taking this further.
I'm happy to do the later commits and PR them separately once this is merged.
My current worry is that the asm
block will stop constant multiplication optimisations e.g. multiplying by 2 gets turned into a shift. At least in our games, we rely on that a lot.
Ideally we'd want to use something like https://doc.rust-lang.org/nightly/std/intrinsics/fn.is_val_statically_known.html but I don't think that'll ever get stabilised.
I've experimented in the past with defining the abi method __aeabi_lmul
(and similar) in agb
itself but didn't have masses of success. But maybe with your changes here it'll work a bit better? That way (in theory) rust will keep optimisations and change things to a shift etc but call the lmul and ulmul instructions when it actually needs to do a 32x32 -> 64 bit multiply. But we'd need to confirm it actually makes the call when it does so.
I checked on godbolt and it doesn't do any kind of optimization even with O3 optimizations enabled. The problem is not only the asm!
macro, but also the switch to arm32 instructions that disrupt any optimization (at least currently on llvm). So, even with a nice wide_mul
intrinsic, we couldn't get any optimization out of it (maybe if the intrinsic was universal for both arm and thumb and the compiler was smart enough). Right now I would suggest to use the shift operator instead of a multiplication for your case (if you use more than 16 bits of precision), but it doesn't indeed optimize way neither the case of the multiplication between two constants, and that's not good
My original plan was just to implement wrapping_add, but the more I worked on it, the more I found things to fix, so here we are. In order:
num!
macro didn't work with high precision numbers, so I fixed that too.FixedWidthUnsignedInteger
andFixedWidthSignedInteger
were quite a mess (they appears as mutually exclusive, bu the signed one actually derived from the unsigned one) and did not took advantage of thenum_traits
crate as much as they could, so I removedFixedWidthSignedInteger
and renamedFixedWidthUnsignedInteger
intoFixedWidthInteger
, since it is what it was.num_traits
supports, because, for example,checked_div
is the only division it supports).All the changes are backed by some tests, that I also run on mgba to be sure the i32/u32 multiplications work on there too, since the code is different for arm (btw, is there a better way then just copy the tests as functions and put them in a rom as I did?). Hopefully this pull request isn't too big, I tried to split all the changes in their own commit to make it easier to review.