embassy-rs / nrf-softdevice

Apache License 2.0
269 stars 80 forks source link

Custom GATT service with u64 characteristic causes the application to stuck/crash in build-release mode #253

Closed igiona closed 6 months ago

igiona commented 6 months ago

I have a pretty basic application (peripheral only) which has a GATT server containing a BAS, DeviceInfo and a custom service. The custom service is defined as follows:

#[nrf_softdevice::gatt_service(uuid = "my-service-uuid")]
pub struct CustomService {
    #[characteristic(
        uuid = "my-char1-uuid",
        security = "justworks",
        read,
        write
    )]
    u64_value: u64,

    #[characteristic(
        uuid = "my-char2-uuid",
        security = "justworks",
        read,
        notify
    )]
    data: heapless::Vec<u8, 17>,

    #[characteristic(
        uuid = "my-char3-uuid",
        security = "justworks",
        write
    )]
    u8_value: u8,
}

My GATT server runs in a dedicated task as follow:

#[embassy_executor::task]
async fn gatt_server_task(connection: Connection, server: &'static Server) {
    let handle = connection.handle();
    let e = gatt_server::run(&connection, server, |evt| match evt {
        ServerEvent::DeviceInfo(_) => { /* Read-only service */ }
        ServerEvent::Bas(bas_evt) => { /* do something */,
        ServerEvent::CustomService(custom_service_event) => {
            match custom_service_event {
                CustomServiceEvent::U64ValueWrite(val) => {
                    // WARNING! Accessing val here stalls the application in release mode!
                    info!("Got new u64 value {}", val);
                },
                _ => { /* ... */ }
            }
        }
    })
    .await;

In debug-mode, the code performs as expected. When I build and run my code in release mode though, everything runs as expected until the point in which I access the GATT value in the OnWrite event: info!("Got new u64 value {}", val); As soon as the BLE client writes the u64_value characteristic, the whole MCU stalls and nothing runs anymore. A reset is required to recover from that state. If I do not access the val at all (log message commented out) in the event (just a debugging step, not really a valuable workaround), the MCU doesn't stuck anymore.

My build-release configuration is as follow:

[profile.release]
codegen-units = 1
debug = 2
debug-assertions = false
incremental = false
lto = "fat"
opt-level = 's'
overflow-checks = false

I have been able to work around this issue with the usage of a heapless::Vec<u8, 8> instead of a u64, and a manual (de-)serialization of the byte array into/from a u64 value.

Can someone explain this behaviour or has ideas on how further debug this?

Dirbaio commented 6 months ago

As soon as the BLE client writes the u64_value characteristic, the whole MCU stalls and nothing runs anymore.

What are you using for logging? If you're using something other than defmt-rttt + a debug probe, it might be a panic that you don't get to see.

igiona commented 6 months ago

As soon as the BLE client writes the u64_value characteristic, the whole MCU stalls and nothing runs anymore.

What are you using for logging? If you're using something other than defmt-rttt + a debug probe, it might be a panic that you don't get to see.

I'm using defmt-rtt = "0.4.0" and probe-rs with a non-JLINK probe (=> AK-CMSIS-DAP JTAG/SWD). If I let the code panic! I can see the panic message from the probe, in this case though there is no message pushed out and actually even some of the older log messages that should be there are not pushed out. I just added a couple of info!("test"); before the access of the incriminated variable, and non of them is arriving on the terminal. I guess that these string are still in some cache of the RTT module...

Dirbaio commented 6 months ago

Then it's not a panic indeed. Strange.

Can you run probe-rs with --always-print-stacktrace, then kill it with control+c when it hangs, and paste the stacktrace? ideally without --release so the stacktrace is better, hopefully it still reprodues in that case.

If the stacktrace points to WFE in the executor it's an async hang, I guess you'd need to add prints inside the nrf-softdevice code to figure out where the hang is...

igiona commented 6 months ago

Then it's not a panic indeed. Strange.

Can you run probe-rs with --always-print-stacktrace, then kill it with control+c when it hangs, and paste the stacktrace? ideally without --release so the stacktrace is better, hopefully it still reprodues in that case.

If the stacktrace points to WFE in the executor it's an async hang, I guess you'd need to add prints inside the nrf-softdevice code to figure out where the hang is...

Thanks for your suggestion, I'll give it a try and come back with the stack trace. Unfortunately, the issue happens only in release mode :'(

alexmoon commented 6 months ago

The implementation of FixedGattValue::from_gatt for Primitive looks sus to me. If data is unaligned, I think you could have UB. It should probably be using core::ptr::read_unaligned instead.

igiona commented 6 months ago

I added --always-print-stacktrace as suggested. The stack trace shows that @alexmoon you are on the right spot!

Frame 0: __cortex_m_rt_HardFault @ 0x0002b692 inline
       src/main.rs:60:5
Frame 1: __cortex_m_rt_HardFault_trampoline @ 0x000000000002b692
       src/main.rs:58:1
Frame 2: "HardFault handler. Cause: Escalated UsageFault (Unaligned access)." @ 0x0002b692
Frame 3: from_gatt<u64> @ 0x00025b44 inline
       .cargo\registry\src\index.crates.io-6f17d22bba15001f\nrf-softdevice-0.1.0\src\ble/gatt_traits.rs:69:18
Frame 4: from_gatt<u64> @ 0x0000000000025b40
       .cargo\registry\src\index.crates.io-6f17d22bba15001f\nrf-softdevice-0.1.0\src\ble/gatt_traits.rs:42:9
Frame 5: on_write @ 0x0001ce5a inline

So I compared the --release build with a normal build with the following code in place:

    fn from_gatt(data: &[u8]) -> Self {
        if data.len() != Self::SIZE {
            panic!("Bad len")
        }
        let data_raw : *const u8 = data.as_ptr();
        if ((data_raw as usize) % 4) != 0 {
            panic!("NOT ALIGNED");
        }      
        unsafe { *(data.as_ptr() as *const Self) }
    }

To my somewhat surprise, in both cases the code panic :/ Maybe my check doesn't do what I assumes it does after all!?

Never the less, following @alexmoon 's suggestion, I modified the to_gatt implementation as follow:

    fn from_gatt(data: &[u8]) -> Self {
        if data.len() != Self::SIZE {
            panic!("Bad len")
        }
        let ptr = data.as_ptr() as *const Self;
        unsafe { ptr.read_unaligned() }
    }

aaaand the code didn't stuck anymore 🥳 !

Many thanks for you support 👍