Closed ramtej closed 1 year ago
I can reproduce this, looks like a misoptimization with hardware floats. It seems like switching to soft floats (disabling floating point hardware acceleration) generates the correct response regardless of the opt-level: rustflags = ["-C", "target-feature=-fp", ...]
.
Do you think you could try and reduce the test case and try and pinpoint which float operations are going wrong? I can then take a further look with that knowledge in hand.
Thank you! Alright, I'll try to identify the float operation or the combination of operations.
Ok, I was able to isolate the case a bit more, however, I guess more questions than answers remain.
pub fn test_compute_butterflies_narrow() {
// Some arbitrary complex f32 numbers
let mut x: Vec<Complex32> = vec![Complex32::new(-0.000000023849761, -0.9238794)];
// Same effect as 'println!("arch[{}]:x = {:?}", ARCH, x);'
std::hint::black_box(&x); // <- without this, the below computation is valid?!
let y1 = x[0] * Complex32::new(0., -1.); // <- ISSUE HERE!
// Note(jj): y1 has to wrong value, sign "flipped" in the imaginary part:
// arch[x86_64]:y = -0.9238794+0.000000023849761i versus
// arch[xtensa]:y = -0.9238794-0.000000023849761i
// .. but only when the above println! or compiler hint (blackbox) is commented in!
println!("arch[{}]:y.1 = {}", ARCH, y1);
let y2 = Complex32::new(-0.000000023849761, -0.9238794) * Complex32::new(0., -1.);
println!("arch[{}]:y.2 = {}", ARCH, y2);
// Note(jj): Fails on esp32 but works on x86
// assert_eq!(y1, y2);
}
Output:
arch[x86_64]:y.1 = -0.9238794+0.000000023849761i
arch[x86_64]:y.2 = -0.9238794+0.000000023849761i
..
arch[xtensa]:y.1 = -0.9238794-0.000000023849761i <- WRONG
arch[xtensa]:y.2 = -0.9238794+0.000000023849761i
See also here: https://github.com/ramtej/esp32-compiler-marvels-std-rs/blob/1032788fbcbf3733c832a0e3af039c00ba79f802/src/main.rs#L73
The problem is when two complex f32 numbes are multiplied, but only if one of these numbers is in an array (or vector?) as seen in the example above.
Another condition is that this vector must be referenced beforehand, e.g. with a println! or a compiler hint. Without this correct value is calculated. The delta is in the imaginary part of the complex number and differ only in sign.
Thank you, Jiri
I was able to narrow down the topic further:
#[derive(Clone, Debug)]
struct MyComplex<T> {
re: T,
im: T,
}
impl MyComplex<f32> {
fn new(re: f32, im: f32) -> Self {
Self { re, im }
}
}
pub fn test_compute_butterflies_narrow_v2() {
let c1 = MyComplex::new(-1., -1.);
let x: Vec<MyComplex<f32>> = vec![c1.clone()];
std::hint::black_box(&x);
//println!("arch[{}]:x = {:?}", ARCH, x);
let c2 = MyComplex::new(0., -1.);
let re = x[0].re * c2.re - x[0].im * c2.im;
let im = x[0].re * c2.im + x[0].im * c2.re;
println!("arch[{}]:re = {}", ARCH, re);
println!("arch[{}]:im = {}", ARCH, im);
assert_eq!(re, -1.0);
assert_eq!(im, 1.0); // <- fails on esp32
}
Output
arch[x86_64]:re = -1
arch[x86_64]:im = 1
..
arch[xtensa]:re = -1
arch[xtensa]:im = -1 <- WRONG
On the other hand, if I perform the same calculation without a struct, correct ones are calculated.
pub fn test_compute_butterflies_narrow_v3() {
let re1 = -1.;
let im1 = -1.;
let x = vec![re1, im1];
std::hint::black_box(&x);
let re2 = 0.0;
let im2 = -1.0;
let re = x[0] * re2 - x[1] * im2;
let im = x[0] * im2 + x[1] * re2;
println!("arch[{}]:re = {}", ARCH, re);
println!("arch[{}]:im = {}", ARCH, im);
assert_eq!(re, -1.0);
assert_eq!(im, 1.0); // <- works on esp32
}
The struct + vec as well as 'println!' or the compiler hint lead to the issue.
Summary
im = -1 * -1 + -1 * 0 = 1 but on the xtensa arch it is -1.
Is there anything else I can do? From my point of view, the issue is quite fundamental and I'm honestly surprised that it hasn't been discovered before. A regression test, fuzzer and/or quickcheck should have found it. What can I do?
Thank you.
I was going to submit an issue but this seems like the same problem as I've got. So I will give my example to reproduce the error:
fn main() {
let a = 4.215796;
println!("{a}"); // print_1
let pi_3 = PI / 3.0;
println!("{pi_3}"); // print_2
let c = 5.0 * pi_3;
println!("{c}"); // print_3
println!("equal and positive {} and {}", c - a, 5.0 * pi_3 - a);
}
Output: "should be the same and positive 1.0201917 and -1.0201919" Let's say this is +/- (there is only a problem with the sign) where it should be +/+ print_1 or print_2 commented out: result: +/+ print_3 commented out: -/-
rustc 1.70.0 esp-id v5.0 24b9d38 using esp32s3
I can reproduce it. It's even more narrowed down, thx!
https://rust.godbolt.org/ -> do we get further there, e.g. to see what the compiler actually generates? Any other ideas?
It is not my specialization so won't be able to help further. Here is a minimal example that I've found:
fn main() -> ! {
let a = 0.0;
let p: f32 = 1.0;
println!("{p}");
println!("{a}");
let v = 0.99 * p - a;
println!("{}", v); // "-0.99"!
loop {}
}
The neat thing about it is that we can "force" the correct sign by changing
let v = 0.99 * p - a;
// into
let v = 0.99 * p + a;
which gives a very small assembly difference:
2933c2933
< 42000751: 4aa980 madd.s f10, f9, f8
---
> 42000751: 5aa980 msub.s f10, f9, f8
Hmm, are we about to invent new arithmetic? The problem is transitive, means it can occur in dependent crates.
Can confirm I have this issue, and had multiple times in the past. Soft FP always fixes it, but I would like the performance, heh.
In my (current) case, doing a final twiddle step in RDFT
fn twiddle<const N: usize>(twiddles: &[C; N], left: &mut [C; N], right: &mut [C; N]) {
for ((&twiddle, left), right) in twiddles
.iter()
.zip(left.iter_mut())
.zip(right.iter_mut().rev())
{
let sum = *left + *right;
let diff = *left - *right;
let twiddled_re_sum = sum * twiddle.re;
let twiddled_im_sum = sum * twiddle.im;
let twiddled_re_diff = diff * twiddle.re;
let twiddled_im_diff = diff * twiddle.im;
let half_sum_re = 0.5 * sum.re;
let half_diff_im = 0.5 * diff.im;
let output_twiddled_real = twiddled_re_sum.im + twiddled_im_diff.re;
let output_twiddled_im = twiddled_im_sum.im - twiddled_re_diff.re;
*left = C {
re: half_sum_re + output_twiddled_real,
im: half_diff_im + output_twiddled_im,
};
*right = C {
re: half_sum_re - output_twiddled_real,
im: output_twiddled_im - half_diff_im,
};
}
}
If I don't put logging in the middle, it will sometimes decide to send the values to the wrong places (swap the im values of left and right, for instance). This makes it apparent that this is a problem in basic arithmetic (addition -> subtraction), since in this case the calculations are anti-symmetric.
@Kurielu @ramtej can you try to reproduce on https://github.com/esp-rs/rust-build/releases/tag/v1.71.0.0 ? I don't have any reason to suspect it got fixed by itself, but who knows.
I can't check on a real esp32s3 right now, but from reading the compiled assembly for
#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
0.99 * y - x
}
From what I can see it translates to
42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.13748823926208495174>:
42007488: 004136 entry a1, 32
4200748b: e49281 l32r a8, 420006d4 <__default_double_exception+0x1c875e0>
4200748e: fa8850 wfr f8, a8
42007491: fa9350 wfr f9, a3
42007494: faa250 wfr f10, a2
42007497: 5aa980 msub.s f10, f9, f8
4200749a: fa2a40 rfr a2, f10
4200749d: f01d retw.n
Read 0.99 into a8 Load a8 into f8, a fp register Load a3(y) into f9 Load a2(x) into f10 write (f10 -(f9 * f8)) to f8 write f8 to a2 and return
So it should return x - y 0.99 instead of 0.99 y - x
I'll check that the madd.s/msub.s instruction encoding of the registers isn't flipped in XtensaInstrInfo.td
Still the same. Maybe this will be a clue for you.
#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
0.99 * y - x
}
#[entry]
fn main() -> ! {
let a = 0.0;
let p = 0.99f32;
let v = math(a, p);
println!("{}", v);
loop {}
}
#[inline(never)]
breaks math right away without the need to print any variable.
With #[inline(always)]
we get a positive number.
In my case it breaks regardless. The instruction definition seems correct, but something else bothers me.
The instruction works like fr - (fs ft), it doesn't support (fx fy) - fz, so it makes sense that it's used this way, but it seems like a negation step is missing.
neg.s f10, f10
#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
unsafe { core::intrinsics::fmaf32(0.99, y, -x) }
}
Generates
42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.6984577647175520593>:
42007488: 004136 entry a1, 32
4200748b: e33b81 l32r a8, 42000178 <__default_double_exception+0x1c87084>
4200748e: 308280 xor a8, a2, a8
42007491: fa8850 wfr f8, a8
42007494: e49081 l32r a8, 420006d4 <__default_double_exception+0x1c875e0>
42007497: fa9850 wfr f9, a8
4200749a: faa350 wfr f10, a3
4200749d: 4a8a90 madd.s f8, f10, f9
420074a0: fa2840 rfr a2, f8
420074a3: f01d retw.n
Xor top bit of x to negate it, and write it to f8
calculate f8 + (y * 0.99)
Which is correct.
Checking LLVM DAG lowering code, specifically performMADD_MSUBCombine
@ramtej @MabezDev @Kurielu
Just an update, I've determined where the problem is indeed in the performMADD_MSUBCombine
. it treats x * y ± z
and z ± x * y
symmetrically, basically equivalently, which is obviously a problem when the symbol is minus.
This code seems to have been copied with minor modifications from MIPS code, which doesn't have a fused multiply sub instruction, which means it handles only the FMA case.
Working on a fix that I will submit as a PR to espressif/llvm-project shortly.
Thanks a lot - you are a hero :-)
Sorry folks, I've been on vacation. Thanks for looking into this and thank you @zRedShift for figuring it out a solution! I'll make sure we get that LLVM PR reviewed asap and cut a patch release of the rust compiler.
I tried my FFT twiddling code from earlier with a newly built version of 1.71.0.1 with my branch of llvm (rust-esp), and it gives the correct results, so all good on that point.
And I finally understand why the floating point values sometimes became correct when printing them (a heisenbug that eluded me for a while, since this miscompilation isn't new). The function performMADD_MSUBCombine
checks if the multiplication results has one use (that is, no one cares about the result of the multiplication by itself), and only then does the fusing. I guess, when printing the intermediate values the printer became the second use of it, and thus the optimization pass was aborted.
The esp 1.72 release is now available as a pre-release, and includes @zRedShift's LLVM patches: https://github.com/esp-rs/rust-build/releases/tag/v1.72.0.0
Hello, I can confirm this and my application now also works with hardware fp - I can measure a performance increase of ~3.6x. Thanks again for the support. I will close this issue now. See you all at the next ticket/issue, lol.
Regards, Jiri
Description
I am comparing various Rust FFT implementations on both native (x86) and ESP32 (xtensa/esp32s3) platforms. The goal is to observe the performance and accuracy of the FFT libraries. I utilize 'std' Rust code for the embedded ESP platform via Rust bindings for ESP-IDF (Espressif's IoT Development Framework).
The following FFT libraries are currently used for testing:
Input and Expected Result
I generate a simple sinusoidal time series that's small enough to fit into the RAM of the ESP32 platform. This is then processed with a forward FFT operation, and a normalized amplitude of the spectrum is calculated. For x86 and ESP32, and for the different FFT libraries, the expected results should be as follows:
Actual Result
However, when the compiler optimization level is changed, I notice some inconsistencies in the results for the ESP32 platform (xtensa). This problem does not occur on the x86 platform.
For example, with opt-level = 0, MicroFFT produces an unexpected value at microfft.amplitudes[5] = 1, while with opt-level = 1, unexpected values are produced at various frequencies, as well as a frequency shift in RealFFT's peak. The same issue appears with opt-level = 3.
This issue needs to be addressed to ensure the FFT results are consistent across all platforms and compiler optimization levels.
Steps to Reproduce
I have already implemented the above steps in the following repository:
https://github.com/ramtej/esp32-compiler-marvels-std-rs
Thank you, Jiri