`"fat"` LTO Miscompilation

zRedShift commented 2 years ago

I tried this code (repo: esp-spinlock-repro):

fn main() -> anyhow::Result<()> {
    esp_idf_sys::link_patches();
    esp_idf_hal::task::critical_section::link();
    esp_idf_svc::timer::embassy_time::driver::link();
    esp_idf_svc::timer::embassy_time::queue::link();
    esp_idf_svc::log::EspLogger::initialize_default();

    let executor = esp_idf_hal::task::executor::EspExecutor::<4, _>::new();
    let task = executor.spawn_local(async {
        let mut counter = 0.0f64;
        loop {
            log::info!("counter at {counter}");
            embassy_time::Timer::after(embassy_time::Duration::from_millis(10)).await;
            counter += 1.0;
        }
    })?;

    executor.run_tasks(|| true, [task]);

    Ok(())
}

When I'm running code I've compiled with "fat" LTO, on 1.64.0.0 or 1.65.0.0, with the xtensa-esp32s3-espidf target on my ESP32S3, regardless of the esp-idf branch, my code hangs indefinitely after 256 iteratiots:

I (314) esp_spinlock_repro: counter at 0
I (324) esp_spinlock_repro: counter at 1
I (334) esp_spinlock_repro: counter at 2
I (344) esp_spinlock_repro: counter at 3
...

I (2914) esp_spinlock_repro: counter at 252
I (2924) esp_spinlock_repro: counter at 253
I (2934) esp_spinlock_repro: counter at 254
I (2944) esp_spinlock_repro: counter at 255

If I disable LTO or move to "thin" LTO, the code keeps chugging along with no end in sight (what I expect to see).

Meta

rustc --version --verbose:

rustc 1.65.0-nightly (bf1b78e4a 2022-11-02)
binary: rustc
commit-hash: bf1b78e4a60871d6fed85f288274db7cdbd82912
commit-date: 2022-11-02
host: x86_64-unknown-linux-gnu
release: 1.65.0-nightly
LLVM version: 15.0.0

N3xed commented 2 years ago

Likely related to a scheduling issue I've had a while back.

Basically, the MPMC queue used for the scheduler is implemented as a ringbuffer using atomic operations, and that ringbuffer has 8 bit (i.e. 256) indices. Now presumably through some atomics bug, wrapping arithmetic bug or a bug in the implementation, the ringbuffer never wraps around, leaving the queue in the full state indefinitely. So no more tasks can be enqueued and so the timer stops working.

Initially, @ivmarkov solved it by using the crossbeam_queue::ArrayQueue as the mpmc queue instead. But I think, more likely is that the issue just got delayed since crossbeams's ArrayQueue uses 32 bit indices and its implementation is based on the same source (i.e. Dmitry Vyukov's bounded MPMC queue). With 32 bit indices the mpmc queue's index would need 2^24 times longer to reach u32::MAX, though I've not tested it if the same wrap-around issue also happens with bigger indices. (Note also that the issue also "goes away" if heapless's feature mpmc_large is used).

Since heapless::mpmc::MpMcQueue is the mpmc queue used by default in edge-executor as its crossbeam-queue feature is disabled by default with default-features = false in esp-idf-hal, I think this is indeed the same issue.

My issue occurred regardless of LTO mode, and someone (I can't remember who) tested it on the esp32c3 where it worked fine. Also note, that this issue even occurred in wokwi for me, so unlikely to be a hardware bug.

MabezDev commented 2 years ago

@N3xed Do you know where the original discussion for the MPMC issue exists? I can't seem to find it across the various esp-rs repos. Might be helpful in producing a smaller example, especially if it was happening without LTO.

N3xed commented 2 years ago

@N3xed Do you know where the original discussion for the MPMC issue exists? I can't seem to find it across the various esp-rs repos. Might be helpful in producing a smaller example, especially if it was happening without LTO.

Unfortunately, I've never created an issue for this, so this is the only place as far as I know. When I discovered this issue I worked around it, went on to other stuff, and never got around to it until @zRedShift brought it up in the matrix chat.

There is my discussion about it with @ivmarkov in the matrix chat though.

zRedShift commented 2 years ago

@MabezDev @N3xed @ivmarkov I significantly reduced the minimal reproduction and moved it to no_std. I also updated the README.md and added the disassembly diff (sort of).

#![no_std]
#![no_main]

use esp_backtrace as _;

#[xtensa_lx_rt::entry]
fn main() -> ! {
    let queue = heapless::mpmc::MpMcQueue::<(), 8>::new();
    let mut counter = 0;
    loop {
        esp_println::println!("counter at {counter}");
        queue.enqueue(()).unwrap();
        queue.dequeue().unwrap();
        counter += 1;
    }
}

Please look for more at: https://github.com/zRedShift/esp-spinlock-repro

zRedShift commented 2 years ago

@MabezDev this is getting to the limit of what I can do, but compiling the following branch: https://github.com/zRedShift/esp-spinlock-repro/blob/without-heapless/src/main.rs (without heapless) with thin and fat LTO and comparing the disassembly (the decompiled C code seemingly is the exact same), we have only a minimal set of changes:

thin LTO

```asm ************************************************************** * FUNCTION * ************************************************************** uint __stdcall inner(uint param_1) uint o2:4 uint i2:4 param_1 _ZN18esp_spinlock_repro5inner17h372c0f243cfa6d XREF[3]: 420003b4(*), esp_spinlock_repro::esp_spinlock_repro::inner __xtensa_lx_rt_main_trampoline:4 .debug_frame::00000d8c(*) 420036f4 36 81 00 entry a1,0x40 420036f7 81 1f f3 l32r a8,DAT_42000374 = 07060504h 420036fa 89 31 s32i.n a8,a1,0xc 420036fc 81 1f f3 l32r a8,DAT_42000378 = 03020100h 420036ff 89 21 s32i.n a8,a1,0x8 42003701 0c 08 movi.n a8,0x0 42003703 82 51 08 s16i a8,a1,0x10 42003706 0c 78 movi.n a8,0x7 42003708 89 11 s32i.n a8,a1,0x4 4200370a 0c 39 movi.n a9,0x3 4200370c a2 a0 ff movi a10,0xff 4200370f 7c fb movi.n a11,0xff 42003711 46 03 00 j LAB_42003722 LAB_42003714 XREF[1]: 420037ea(j) 42003714 82 c1 08 addi a8,a1,0x8 42003717 d0 88 20 or a8,a8,a13 4200371a c0 20 00 memw 4200371d 8b cc addi.n a12,a12,0x8 4200371f c2 48 00 s8i a12,a8,0x0 LAB_42003722 XREF[2]: 42003711(j), 42003805(j) 42003722 d2 01 11 l8ui a13,a1,0x11 42003725 88 11 l32i.n a8,a1,0x4 42003727 80 ed 10 and a14,a13,a8 4200372a 1b cd addi.n a12,a13,0x1 LAB_4200372c XREF[1]: 4200377d(j) 4200372c f2 c1 08 addi a15,a1,0x8 4200372f e0 ff 20 or a15,a15,a14 42003732 f2 0f 00 l8ui a15,a15,0x0 42003735 c0 20 00 memw 42003738 d7 1f 02 beq a15,a13,LAB_4200373e 4200373b 46 2f 00 j LAB_420037fc LAB_4200373e XREF[1]: 42003738(j) 4200373e f2 c1 08 addi a15,a1,0x8 42003741 9b ff addi.n a15,a15,0x9 42003743 f0 69 10 and a6,a9,a15 42003746 60 7f c0 sub a7,a15,a6 42003749 d0 f6 11 slli a15,a6,0x3 4200374c 00 1f 40 ssl a15 4200374f 00 6a a1 sll a6,a10 42003752 b0 66 30 xor a6,a6,a11 42003755 58 07 l32i.n a5,a7,0x0 42003757 60 25 10 and a2,a5,a6 4200375a 00 5d a1 sll a5,a13 4200375d 00 4c a1 sll a4,a12 LAB_42003760 XREF[1]: 42003771(j) 42003760 8d 02 mov.n a8,a2 42003762 80 25 20 or a2,a5,a8 42003765 80 34 20 or a3,a4,a8 42003768 20 0c 13 wsr a2,SCOMPARE1 4200376b 32 e7 00 s32c1i a3,a7,0x0 4200376e 60 23 10 and a2,a3,a6 42003771 87 92 eb bne a2,a8,LAB_42003760 42003774 00 0f 40 ssr a15 42003777 30 80 91 srl a8,a3 4200377a a0 88 10 and a8,a8,a10 4200377d d7 98 ab bne a8,a13,LAB_4200372c 42003780 82 c1 08 addi a8,a1,0x8 42003783 e0 88 20 or a8,a8,a14 42003786 c0 20 00 memw 42003789 c2 48 00 s8i a12,a8,0x0 LAB_4200378c XREF[1]: 420037f6(j) 4200378c c2 01 10 l8ui a12,a1,0x10 4200378f 88 11 l32i.n a8,a1,0x4 42003791 80 dc 10 and a13,a12,a8 42003794 1b ec addi.n a14,a12,0x1 LAB_42003796 XREF[1]: 420037e7(j) 42003796 a0 8e 10 and a8,a14,a10 42003799 f2 c1 08 addi a15,a1,0x8 4200379c d0 ff 20 or a15,a15,a13 4200379f 62 0f 00 l8ui a6,a15,0x0 420037a2 c0 20 00 memw 420037a5 87 96 44 bne a6,a8,LAB_420037ed 420037a8 82 c1 08 addi a8,a1,0x8 420037ab 8b 88 addi.n a8,a8,0x8 420037ad 80 f9 10 and a15,a9,a8 420037b0 f0 78 c0 sub a7,a8,a15 420037b3 d0 ff 11 slli a15,a15,0x3 420037b6 00 1f 40 ssl a15 420037b9 00 8a a1 sll a8,a10 420037bc b0 58 30 xor a5,a8,a11 420037bf 88 07 l32i.n a8,a7,0x0 420037c1 50 28 10 and a2,a8,a5 420037c4 00 4c a1 sll a4,a12 420037c7 00 66 a1 sll a6,a6 LAB_420037ca XREF[1]: 420037db(j) 420037ca 8d 02 mov.n a8,a2 420037cc 80 24 20 or a2,a4,a8 420037cf 80 36 20 or a3,a6,a8 420037d2 20 0c 13 wsr a2,SCOMPARE1 420037d5 32 e7 00 s32c1i a3,a7,0x0 420037d8 50 23 10 and a2,a3,a5 420037db 87 92 eb bne a2,a8,LAB_420037ca 420037de 00 0f 40 ssr a15 420037e1 30 80 91 srl a8,a3 420037e4 a0 88 10 and a8,a8,a10 420037e7 c7 98 ab bne a8,a12,LAB_42003796 420037ea 86 c9 ff j LAB_42003714 LAB_420037ed XREF[1]: 420037a5(j) 420037ed e0 86 c0 sub a8,a6,a14 420037f0 80 88 01 slli a8,a8,0x18 420037f3 80 88 31 srai a8,a8,0x18 420037f6 d6 28 f9 bgez a8,LAB_4200378c 420037f9 c6 02 00 j LAB_42003808 LAB_420037fc XREF[1]: 4200373b(j) 420037fc d0 8f c0 sub a8,a15,a13 420037ff 80 88 01 slli a8,a8,0x18 42003802 80 88 31 srai a8,a8,0x18 42003805 d6 98 f1 bgez a8,LAB_42003722 LAB_42003808 XREF[1]: 420037f9(j) 42003808 1d f0 retw.n 4200380a 00 ?? 00h 4200380b 00 ?? 00h ```

fat LTO

```asm ************************************************************** * esp_spinlock_repro::inner * ************************************************************** uint __stdcall inner(uint param_1) uint o2:4 uint i2:4 param_1 _ZN18esp_spinlock_repro5inner17h4a0b3ef61100c1 XREF[3]: 42000068(*), main:42000681(c), esp_spinlock_repro::inner .debug_frame::0000001c(*) 4200045c 36 81 00 entry a1,0x40 4200045f 81 f0 fe l32r a8,_text_start = 07060504h 42000462 89 31 s32i.n a8,a1,0xc 42000464 81 f0 fe l32r a8,DAT_42000024 = 03020100h 42000467 89 21 s32i.n a8,a1,0x8 42000469 0c 08 movi.n a8,0x0 4200046b 82 51 08 s16i a8,a1,0x10 4200046e 0c 78 movi.n a8,0x7 42000470 89 11 s32i.n a8,a1,0x4 42000472 0c 39 movi.n a9,0x3 42000474 a2 a0 ff movi a10,0xff 42000477 7c fb movi.n a11,0xff 42000479 46 03 00 j LAB_4200048a LAB_4200047c XREF[1]: 42000552(j) 4200047c 82 c1 08 addi a8,a1,0x8 4200047f d0 88 20 or a8,a8,a13 42000482 c0 20 00 memw 42000485 8b cc addi.n a12,a12,0x8 42000487 c2 48 00 s8i a12,a8,0x0 LAB_4200048a XREF[2]: 42000479(j), 4200056d(j) 4200048a d2 01 11 l8ui a13,a1,0x11 4200048d 88 11 l32i.n a8,a1,0x4 4200048f 80 ed 10 and a14,a13,a8 42000492 1b cd addi.n a12,a13,0x1 LAB_42000494 XREF[1]: 420004e5(j) 42000494 f2 c1 08 addi a15,a1,0x8 42000497 e0 ff 20 or a15,a15,a14 4200049a f2 0f 00 l8ui a15,a15,0x0 4200049d c0 20 00 memw 420004a0 d7 1f 02 beq a15,a13,LAB_420004a6 420004a3 46 2f 00 j LAB_42000564 LAB_420004a6 XREF[1]: 420004a0(j) 420004a6 f2 c1 08 addi a15,a1,0x8 420004a9 9b ff addi.n a15,a15,0x9 420004ab f0 69 10 and a6,a9,a15 420004ae 60 7f c0 sub a7,a15,a6 420004b1 d0 f6 11 slli a15,a6,0x3 420004b4 00 1f 40 ssl a15 420004b7 00 6a a1 sll a6,a10 420004ba b0 66 30 xor a6,a6,a11 420004bd 58 07 l32i.n a5,a7,0x0 420004bf 60 25 10 and a2,a5,a6 420004c2 00 5d a1 sll a5,a13 420004c5 00 4c a1 sll a4,a12 LAB_420004c8 XREF[1]: 420004d9(j) 420004c8 8d 02 mov.n a8,a2 420004ca 80 25 20 or a2,a5,a8 420004cd 80 34 20 or a3,a4,a8 420004d0 20 0c 13 wsr a2,SCOMPARE1 420004d3 32 e7 00 s32c1i a3,a7,0x0 420004d6 60 23 10 and a2,a3,a6 420004d9 87 92 eb bne a2,a8,LAB_420004c8 420004dc 00 0f 40 ssr a15 420004df 30 80 91 srl a8,a3 420004e2 a0 88 10 and a8,a8,a10 420004e5 d7 98 ab bne a8,a13,LAB_42000494 420004e8 82 c1 08 addi a8,a1,0x8 420004eb e0 88 20 or a8,a8,a14 420004ee c0 20 00 memw 420004f1 c2 48 00 s8i a12,a8,0x0 LAB_420004f4 XREF[1]: 4200055e(j) 420004f4 c2 01 10 l8ui a12,a1,0x10 420004f7 88 11 l32i.n a8,a1,0x4 420004f9 80 dc 10 and a13,a12,a8 420004fc 1b ec addi.n a14,a12,0x1 LAB_420004fe XREF[1]: 4200054f(j) 420004fe a0 8e 10 and a8,a14,a10 42000501 f2 c1 08 addi a15,a1,0x8 42000504 d0 ff 20 or a15,a15,a13 42000507 f2 0f 00 l8ui a15,a15,0x0 4200050a c0 20 00 memw 4200050d 87 9f 44 bne a15,a8,LAB_42000555 42000510 82 c1 08 addi a8,a1,0x8 42000513 8b 88 addi.n a8,a8,0x8 42000515 80 f9 10 and a15,a9,a8 42000518 f0 78 c0 sub a7,a8,a15 4200051b d0 ff 11 slli a15,a15,0x3 4200051e 00 1f 40 ssl a15 42000521 00 8a a1 sll a8,a10 42000524 b0 68 30 xor a6,a8,a11 42000527 88 07 l32i.n a8,a7,0x0 42000529 60 28 10 and a2,a8,a6 4200052c 00 5c a1 sll a5,a12 4200052f 00 4e a1 sll a4,a14 LAB_42000532 XREF[1]: 42000543(j) 42000532 8d 02 mov.n a8,a2 42000534 80 25 20 or a2,a5,a8 42000537 80 34 20 or a3,a4,a8 4200053a 20 0c 13 wsr a2,SCOMPARE1 4200053d 32 e7 00 s32c1i a3,a7,0x0 42000540 60 23 10 and a2,a3,a6 42000543 87 92 eb bne a2,a8,LAB_42000532 42000546 00 0f 40 ssr a15 42000549 30 80 91 srl a8,a3 4200054c a0 88 10 and a8,a8,a10 4200054f c7 98 ab bne a8,a12,LAB_420004fe 42000552 86 c9 ff j LAB_4200047c LAB_42000555 XREF[1]: 4200050d(j) 42000555 e0 8f c0 sub a8,a15,a14 42000558 80 88 01 slli a8,a8,0x18 4200055b 80 88 31 srai a8,a8,0x18 4200055e d6 28 f9 bgez a8,LAB_420004f4 42000561 c6 02 00 j LAB_42000570 LAB_42000564 XREF[1]: 420004a3(j) 42000564 d0 8f c0 sub a8,a15,a13 42000567 80 88 01 slli a8,a8,0x18 4200056a 80 88 31 srai a8,a8,0x18 4200056d d6 98 f1 bgez a8,LAB_4200048a LAB_42000570 XREF[1]: 42000561(j) 42000570 1d f0 retw.n 42000572 00 ?? 00h 42000573 00 ?? 00h ```

Besides label, cross-references and debug-info, the only difference is the register allocation, which after equalizing the labels comes down to this diff:

79c79
<         4200379f 62 0f 00        l8ui       a6,a15,0x0
---
>         4200379f f2 0f 00        l8ui       a15,a15,0x0
81c81
<         420037a5 87 96 44        bne        a6,a8,LAB_420037ed
---
>         420037a5 87 9f 44        bne        a15,a8,LAB_420037ed
89c89
<         420037bc b0 58 30        xor        a5,a8,a11
---
>         420037bc b0 68 30        xor        a6,a8,a11
91,93c91,93
<         420037c1 50 28 10        and        a2,a8,a5
<         420037c4 00 4c a1        sll        a4,a12
<         420037c7 00 66 a1        sll        a6,a6
---
>         420037c1 60 28 10        and        a2,a8,a6
>         420037c4 00 5c a1        sll        a5,a12
>         420037c7 00 4e a1        sll        a4,a14
96,97c96,97
<         420037cc 80 24 20        or         a2,a4,a8
<         420037cf 80 36 20        or         a3,a6,a8
---
>         420037cc 80 25 20        or         a2,a5,a8
>         420037cf 80 34 20        or         a3,a4,a8
100c100
<         420037d8 50 23 10        and        a2,a3,a5
---
>         420037d8 60 23 10        and        a2,a3,a6
108c108
<         420037ed e0 86 c0        sub        a8,a6,a14
---
>         420037ed e0 8f c0        sub        a8,a15,a14

zRedShift commented 2 years ago

Narrowed it down to 4200052f 00 4e a1 sll a4,a14 in the fat lto file it should be 4200052f 00 4f a1 sll a4,a15, right before the CAS loop.

MabezDev commented 4 weeks ago

I ported your code to https://github.com/esp-rs/esp-hal/compare/main...MabezDev:esp-hal:esp-spinlock-repro and retried (all examples are compiled with lto='fat'). This bug seems to be resolved.

esp-rs / rust

`"fat"` LTO Miscompilation #146

Meta