Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform

ramtej commented 1 year ago

Description

I am comparing various Rust FFT implementations on both native (x86) and ESP32 (xtensa/esp32s3) platforms. The goal is to observe the performance and accuracy of the FFT libraries. I utilize 'std' Rust code for the embedded ESP platform via Rust bindings for ESP-IDF (Espressif's IoT Development Framework).

The following FFT libraries are currently used for testing:

FFTW
RustFFT / RealFFT
Microfft

Input and Expected Result

I generate a simple sinusoidal time series that's small enough to fit into the RAM of the ESP32 platform. This is then processed with a forward FFT operation, and a normalized amplitude of the spectrum is calculated. For x86 and ESP32, and for the different FFT libraries, the expected results should be as follows:

arch[x86_64]:ffw3.amplitudes[3] = 8

Actual Result

However, when the compiler optimization level is changed, I notice some inconsistencies in the results for the ESP32 platform (xtensa). This problem does not occur on the x86 platform.

For example, with opt-level = 0, MicroFFT produces an unexpected value at microfft.amplitudes[5] = 1, while with opt-level = 1, unexpected values are produced at various frequencies, as well as a frequency shift in RealFFT's peak. The same issue appears with opt-level = 3.

arch[xtensa]:microfft.amplitudes[0] = 0
arch[xtensa]:microfft.amplitudes[1] = 2 <- ?
arch[xtensa]:microfft.amplitudes[2] = 0
arch[xtensa]:microfft.amplitudes[3] = 3 <- ?
arch[xtensa]:microfft.amplitudes[4] = 0
arch[xtensa]:microfft.amplitudes[5] = 5 <- ?
arch[xtensa]:microfft.amplitudes[6] = 0
arch[xtensa]:microfft.amplitudes[7] = 4 <- ?
..
..
arch[xtensa]:realfft.amplitudes[0] = 0
arch[xtensa]:realfft.amplitudes[1] = 0
arch[xtensa]:realfft.amplitudes[2] = 0
arch[xtensa]:realfft.amplitudes[3] = 0
arch[xtensa]:realfft.amplitudes[4] = 0
arch[xtensa]:realfft.amplitudes[5] = 8 <- Valid value, but shifted in frequency!?
arch[xtensa]:realfft.amplitudes[6] = 0
arch[xtensa]:realfft.amplitudes[7] = 0
arch[xtensa]:realfft.amplitudes[8] = 0

This issue needs to be addressed to ensure the FFT results are consistent across all platforms and compiler optimization levels.

Steps to Reproduce

Generate the sinusoidal time series.
Perform a forward FFT operation.
Calculate a normalized amplitude of the spectrum.
Change the compiler optimization level.
Repeat steps 2 and 3 and compare results.

I have already implemented the above steps in the following repository:

https://github.com/ramtej/esp32-compiler-marvels-std-rs

Thank you, Jiri

MabezDev commented 1 year ago

I can reproduce this, looks like a misoptimization with hardware floats. It seems like switching to soft floats (disabling floating point hardware acceleration) generates the correct response regardless of the opt-level: rustflags = ["-C", "target-feature=-fp", ...].

Do you think you could try and reduce the test case and try and pinpoint which float operations are going wrong? I can then take a further look with that knowledge in hand.

ramtej commented 1 year ago

Thank you! Alright, I'll try to identify the float operation or the combination of operations.

ramtej commented 1 year ago

Ok, I was able to isolate the case a bit more, however, I guess more questions than answers remain.

pub fn test_compute_butterflies_narrow() {
    // Some arbitrary complex f32 numbers
    let mut x: Vec<Complex32> = vec![Complex32::new(-0.000000023849761, -0.9238794)];
    // Same effect as 'println!("arch[{}]:x = {:?}", ARCH, x);'
    std::hint::black_box(&x); // <- without this, the below computation is valid?!
    let y1 = x[0] * Complex32::new(0., -1.); // <- ISSUE HERE!
                                             // Note(jj): y1 has to wrong value, sign "flipped" in the imaginary part:
                                             // arch[x86_64]:y = -0.9238794+0.000000023849761i versus
                                             // arch[xtensa]:y = -0.9238794-0.000000023849761i
                                             // .. but only when the above println! or compiler hint (blackbox) is commented in!
    println!("arch[{}]:y.1 = {}", ARCH, y1);

    let y2 = Complex32::new(-0.000000023849761, -0.9238794) * Complex32::new(0., -1.);
    println!("arch[{}]:y.2 = {}", ARCH, y2);

    //  Note(jj): Fails on esp32 but works on x86
    // assert_eq!(y1, y2);
}

Output:

arch[x86_64]:y.1 = -0.9238794+0.000000023849761i
arch[x86_64]:y.2 = -0.9238794+0.000000023849761i
..
arch[xtensa]:y.1 = -0.9238794-0.000000023849761i  <- WRONG
arch[xtensa]:y.2 = -0.9238794+0.000000023849761i

The problem is when two complex f32 numbes are multiplied, but only if one of these numbers is in an array (or vector?) as seen in the example above.

Another condition is that this vector must be referenced beforehand, e.g. with a println! or a compiler hint. Without this correct value is calculated. The delta is in the imaginary part of the complex number and differ only in sign.

Thank you, Jiri

ramtej commented 1 year ago

I was able to narrow down the topic further:

#[derive(Clone, Debug)]
struct MyComplex<T> {
    re: T,
    im: T,
}

impl MyComplex<f32> {
    fn new(re: f32, im: f32) -> Self {
        Self { re, im }
    }
}

pub fn test_compute_butterflies_narrow_v2() {
    let c1 = MyComplex::new(-1., -1.);
    let x: Vec<MyComplex<f32>> = vec![c1.clone()];

    std::hint::black_box(&x);
    //println!("arch[{}]:x = {:?}", ARCH, x);

    let c2 = MyComplex::new(0., -1.);
    let re = x[0].re * c2.re - x[0].im * c2.im;
    let im = x[0].re * c2.im + x[0].im * c2.re;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);
    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- fails on esp32
}

Output

arch[x86_64]:re = -1
arch[x86_64]:im = 1
..
arch[xtensa]:re = -1
arch[xtensa]:im = -1 <- WRONG

On the other hand, if I perform the same calculation without a struct, correct ones are calculated.

pub fn test_compute_butterflies_narrow_v3() {
    let re1 = -1.;
    let im1 = -1.;
    let x = vec![re1, im1];

    std::hint::black_box(&x);

    let re2 = 0.0;
    let im2 = -1.0;

    let re = x[0] * re2 - x[1] * im2;
    let im = x[0] * im2 + x[1] * re2;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);

    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- works on esp32
}

The struct + vec as well as 'println!' or the compiler hint lead to the issue.

Summary

im = -1 * -1 + -1 * 0 = 1 but on the xtensa arch it is -1.

ramtej commented 1 year ago

Is there anything else I can do? From my point of view, the issue is quite fundamental and I'm honestly surprised that it hasn't been discovered before. A regression test, fuzzer and/or quickcheck should have found it. What can I do?

Thank you.

Kurielu commented 1 year ago

I was going to submit an issue but this seems like the same problem as I've got. So I will give my example to reproduce the error:

fn main() {
    let a = 4.215796;
    println!("{a}"); // print_1
    let pi_3 = PI / 3.0;
    println!("{pi_3}"); // print_2
    let c = 5.0 * pi_3;
    println!("{c}"); // print_3
    println!("equal and positive {} and {}", c - a, 5.0 * pi_3 - a);
}

Output: "should be the same and positive 1.0201917 and -1.0201919" Let's say this is +/- (there is only a problem with the sign) where it should be +/+ print_1 or print_2 commented out: result: +/+ print_3 commented out: -/-

rustc 1.70.0 esp-id v5.0 24b9d38 using esp32s3

ramtej commented 1 year ago

I can reproduce it. It's even more narrowed down, thx!

https://rust.godbolt.org/ -> do we get further there, e.g. to see what the compiler actually generates? Any other ideas?

Kurielu commented 1 year ago

It is not my specialization so won't be able to help further. Here is a minimal example that I've found:

fn main() -> ! {
    let a = 0.0;
    let p: f32 = 1.0;
    println!("{p}");
    println!("{a}");
    let v = 0.99 * p - a;
    println!("{}", v); // "-0.99"!
    loop {}
}

The neat thing about it is that we can "force" the correct sign by changing

let v = 0.99 * p - a;
// into
let v = 0.99 * p + a;

which gives a very small assembly difference:

2933c2933
< 42000751: 4aa980          madd.s  f10, f9, f8
---
> 42000751: 5aa980          msub.s  f10, f9, f8

ramtej commented 1 year ago

Hmm, are we about to invent new arithmetic? The problem is transitive, means it can occur in dependent crates.

zRedShift commented 1 year ago

Can confirm I have this issue, and had multiple times in the past. Soft FP always fixes it, but I would like the performance, heh.

In my (current) case, doing a final twiddle step in RDFT

fn twiddle<const N: usize>(twiddles: &[C; N], left: &mut [C; N], right: &mut [C; N]) {
    for ((&twiddle, left), right) in twiddles
        .iter()
        .zip(left.iter_mut())
        .zip(right.iter_mut().rev())
    {
        let sum = *left + *right;
        let diff = *left - *right;
        let twiddled_re_sum = sum * twiddle.re;
        let twiddled_im_sum = sum * twiddle.im;
        let twiddled_re_diff = diff * twiddle.re;
        let twiddled_im_diff = diff * twiddle.im;
        let half_sum_re = 0.5 * sum.re;
        let half_diff_im = 0.5 * diff.im;
        let output_twiddled_real = twiddled_re_sum.im + twiddled_im_diff.re;
        let output_twiddled_im = twiddled_im_sum.im - twiddled_re_diff.re;
        *left = C {
            re: half_sum_re + output_twiddled_real,
            im: half_diff_im + output_twiddled_im,
        };
        *right = C {
            re: half_sum_re - output_twiddled_real,
            im: output_twiddled_im - half_diff_im,
        };
    }
}

If I don't put logging in the middle, it will sometimes decide to send the values to the wrong places (swap the im values of left and right, for instance). This makes it apparent that this is a problem in basic arithmetic (addition -> subtraction), since in this case the calculations are anti-symmetric.

zRedShift commented 1 year ago

@Kurielu @ramtej can you try to reproduce on https://github.com/esp-rs/rust-build/releases/tag/v1.71.0.0 ? I don't have any reason to suspect it got fixed by itself, but who knows.

zRedShift commented 1 year ago

I can't check on a real esp32s3 right now, but from reading the compiled assembly for

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

From what I can see it translates to

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.13748823926208495174>:
42007488:       004136          entry   a1, 32
4200748b:       e49281          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
4200748e:       fa8850          wfr     f8, a8
42007491:       fa9350          wfr     f9, a3
42007494:       faa250          wfr     f10, a2
42007497:       5aa980          msub.s  f10, f9, f8
4200749a:       fa2a40          rfr     a2, f10
4200749d:       f01d            retw.n

Read 0.99 into a8 Load a8 into f8, a fp register Load a3(y) into f9 Load a2(x) into f10 write (f10 -(f9 * f8)) to f8 write f8 to a2 and return

So it should return x - y 0.99 instead of 0.99 y - x

I'll check that the madd.s/msub.s instruction encoding of the registers isn't flipped in XtensaInstrInfo.td

Kurielu commented 1 year ago

Still the same. Maybe this will be a clue for you.

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

#[entry]
fn main() -> ! {
    let a = 0.0;
    let p = 0.99f32;
    let v = math(a, p);
    println!("{}", v);
    loop {}
}

#[inline(never)]breaks math right away without the need to print any variable. With #[inline(always)] we get a positive number.

zRedShift commented 1 year ago

In my case it breaks regardless. The instruction definition seems correct, but something else bothers me.

The instruction works like fr - (fs ft), it doesn't support (fx fy) - fz, so it makes sense that it's used this way, but it seems like a negation step is missing. neg.s f10, f10

zRedShift commented 1 year ago

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    unsafe { core::intrinsics::fmaf32(0.99, y, -x) }
}

Generates

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.6984577647175520593>:
42007488:       004136          entry   a1, 32
4200748b:       e33b81          l32r    a8, 42000178 <__default_double_exception+0x1c87084>
4200748e:       308280          xor     a8, a2, a8
42007491:       fa8850          wfr     f8, a8
42007494:       e49081          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
42007497:       fa9850          wfr     f9, a8
4200749a:       faa350          wfr     f10, a3
4200749d:       4a8a90          madd.s  f8, f10, f9
420074a0:       fa2840          rfr     a2, f8
420074a3:       f01d            retw.n

Xor top bit of x to negate it, and write it to f8 calculate f8 + (y * 0.99) Which is correct. Checking LLVM DAG lowering code, specifically performMADD_MSUBCombine

zRedShift commented 1 year ago

@ramtej @MabezDev @Kurielu Just an update, I've determined where the problem is indeed in the performMADD_MSUBCombine. it treats x * y ± z and z ± x * y symmetrically, basically equivalently, which is obviously a problem when the symbol is minus.

This code seems to have been copied with minor modifications from MIPS code, which doesn't have a fused multiply sub instruction, which means it handles only the FMA case.

Working on a fix that I will submit as a PR to espressif/llvm-project shortly.

ramtej commented 1 year ago

Thanks a lot - you are a hero :-)

MabezDev commented 1 year ago

Sorry folks, I've been on vacation. Thanks for looking into this and thank you @zRedShift for figuring it out a solution! I'll make sure we get that LLVM PR reviewed asap and cut a patch release of the rust compiler.

zRedShift commented 1 year ago

I tried my FFT twiddling code from earlier with a newly built version of 1.71.0.1 with my branch of llvm (rust-esp), and it gives the correct results, so all good on that point.

And I finally understand why the floating point values sometimes became correct when printing them (a heisenbug that eluded me for a while, since this miscompilation isn't new). The function performMADD_MSUBCombine checks if the multiplication results has one use (that is, no one cares about the result of the multiplication by itself), and only then does the fusing. I guess, when printing the intermediate values the printer became the second use of it, and thus the optimization pass was aborted.

MabezDev commented 1 year ago

The esp 1.72 release is now available as a pre-release, and includes @zRedShift's LLVM patches: https://github.com/esp-rs/rust-build/releases/tag/v1.72.0.0

ramtej commented 1 year ago

Hello, I can confirm this and my application now also works with hardware fp - I can measure a performance increase of ~3.6x. Thanks again for the support. I will close this issue now. See you all at the next ticket/issue, lol.

Regards, Jiri

esp-rs / rust