chrissnow commented 3 years ago

Description of defect

We are using a STM32L151CC on a custom PCB, with a SX1262 radio, its roughly based on a Nucleo-L152RE board + SX1262MB2xAS design.

Everything works reliably until we enable tickless at which point join and downlinks become unreliable, probably <50% success rate. The network is receiving and replying so it must be a timing problem, likely the RX1 and RX2 slots are not timed well enough.

I appreciate it's a bit custom but the underlying fault is within Mbed somewhere, and likely affects other targets.

I will try and reproduce it on a "normal" target.

Were already well behind schedule on the project so any help would be greatly appreciated!

 Sending confirmed message.
[INFO][LMAC]: RTS = 6 bytes, PEND = 0, Port: 15
[DBG ][LMAC]: Frame prepared to send at port 15
[DBG ][LMAC]: TX: Channel=65, TX DR=4, RX1 DR=13

 6 bytes scheduled for transmission
[DBG ][LSTK]: Transmission completed
[DBG ][LSTK]: Awaiting ACK
[DBG ][LMAC]: RX1 slot open, Freq = 923900000
[DBG ][LMAC]: RX2 slot open, Freq = 923300000
[DBG ][LMAC]: ACK_TIMEOUT Elapses, Retrying ...
[DBG ][LMAC]: Trading datarate for range
[DBG ][LMAC]: TX: Channel=13, TX DR=3, RX1 DR=13
[DBG ][LSTK]: Transmission completed
[DBG ][LMAC]: RX1 slot open, Freq = 926300000
[DBG ][LMAC]: RX2 slot open, Freq = 923300000
[DBG ][LMAC]: ACK_TIMEOUT Elapses, Retrying ...
[DBG ][LMAC]: TX: Channel=11, TX DR=3, RX1 DR=13
[DBG ][LSTK]: Transmission completed
[DBG ][LMAC]: RX1 slot open, Freq = 925100000
[DBG ][LMAC]: RX2 slot open, Freq = 923300000
[DBG ][LMAC]: ACK_TIMEOUT Elapses, Retrying ...
[DBG ][LMAC]: Trading datarate for range
[DBG ][LMAC]: TX: Channel=8, TX DR=2, RX1 DR=12
[DBG ][LSTK]: Transmission completed
[DBG ][LMAC]: RX1 slot open, Freq = 923300000
[DBG ][LMAC]: RX2 slot open, Freq = 923300000
[ERR ][LSTK]: Retries exhausted for Class A device

Target(s) affected by this defect ?

STM32L151CC

Toolchain(s) (name and version) displaying this defect ?

ARMC6

What version of Mbed-os are you using (tag or sha) ?

mbed-os-6.9.0 Though we had the same issue with 6.5.0 too

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

mbed-cli 1.10.5

How is this defect reproduced ?

We are working out the easiest way for someone else to reproduce it, probably a Nucleo-L152RE board + SX1262MB2xAS shield.

An xDot might have the same problem but has a different radio.

We have this in our mbed_app

            "target.macros_add": ["MBED_TICKLESS=1"],
            "events.use-lowpower-timer-ticker": true,

chrissnow commented 3 years ago

We have confirmed that Nucleo-L152RE board + SX1262MB2xAS shield has the same problem

using mbed-os-example-lorawan

"target.macros_add": ["MBED_TICKLESS=1"],
"events.use-lowpower-timer-ticker": true,

jeromecoutant commented 3 years ago

We agree that enabling TICKLESS with STM32L1 is not recommended.

chrissnow commented 3 years ago

We agree that enabling TICKLESS with STM32L1 is not recommended.

That's rather bad news for us, any particular reason and anything we can do make it work better? it seems to nearly work..

chrissnow commented 3 years ago

@jeromecoutant Done a bit more testing with a WL55JC1 and enabling tickless on that also breaks things, becomes very unreliable to join and downlink.

I'd hope that tickless should work on a WL55? The fault seems common across multiple families.

jeromecoutant commented 3 years ago

STM32WL is tickless by default https://github.com/ARMmbed/mbed-os/blob/master/targets/targets.json#L4236

chrissnow commented 3 years ago

Interesting, perhaps it's "events.use-lowpower-timer-ticker": true causing the trouble then.

I will try without it.

ciarmcom commented 3 years ago

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers. Internal Jira reference: https://jira.arm.com/browse/IOTOSM-3696

chrissnow commented 3 years ago

Something very odd going on here.. I thought I had the WL55 working well, but I'm not convinced anymore.

At the moment tickless or not I can't get the WL55 reliable, my only change to the LoRaWAN example is to add some keys and make each downlink confirmed (every 10 seconds)

Our custom target seems more reliable without tickless, but I'm not certain of it.

Still testing and will report back what I can, it may not be STM related in the end but it's the only targets I can easily run LoRaWAN on.

chrissnow commented 3 years ago

It seems that #11502 is the true cause, tickless or not the WL55 is unusable without "lora.max-sys-rx-error": 200 Which is really rather wasteful energy wise. I can't believe that we need to give 200ms either side of the expected RX window to be able to reliably get a downlink.

@jeromecoutant @0xc0170 I can spend some time on this early next week but I'm not really sure how best to debug it, or exactly how it's meant to work in the first place...

I'm not sure if this is STM32 specific at the moment.

Without fixing this LoRaWAN support in Mbed is pretty much unusable, The workaround isn't really suitable for production.

jeromecoutant commented 3 years ago

the WL55 is not unusable without "lora.max-sys-rx-error": 200

@ludoch-stm Maybe you could have some idea ? Thx

chrissnow commented 3 years ago

Having thought about this a bit more if #11502 is correct regarding it being SF dependant it's perhaps not the timing of the RX window opening that's the problem, given it's the same for all SF, however what is different is how long to leave it in RX (or wait for it to complete), could the stack be giving up mid way through successfully receiving the data?

I will try and get some timing data off a logic analyser.

chrissnow commented 3 years ago

@ludoch-stm Apologies to chase, Any help would be greatly appreciated, 2 days of debugging and not made any progress :-(

chrissnow commented 3 years ago

I have made some progress in debugging the problem.

Build configurations, all with tickless +use-lowpower-timer-ticker, though without makes no difference.

xDot_L151CC, internal SX1272, works perfectly. NUCLEO_L152RE + SX1272MB2xAS, works perfectly. NUCLEO_F446RE + SX1272MB2xAS, works perfectly.

NUCLEO_L152RE + SX126xMB2xAS, SF7 rarely works. SF8 & SF9 works sometimes, SF10 is reliable, increasing max-sys-rx-error makes things worse. NUCLEO_WL55JC1, similar to L152, but max-sys-rx-error 200 makes it reliable. NUCLEO_F446RE + SX126xMB2xAS, works perfectly.

Based on this I think there are multiple issues. Something is different between how both radios work. something is wrong with the WL55.

chrissnow commented 3 years ago

mbed app attached for how I tested it, you will need to add keys, only change to the example is to send confirmed.

    retcode = lorawan.send(MBED_CONF_LORA_APP_PORT, tx_buffer, packet_len,
                           MSG_CONFIRMED_FLAG);

mbed_app.txt

chrissnow commented 3 years ago

@0xc0170 are you able to get any support from whoever is responsible for the LoRa drivers?

ludoch-stm commented 3 years ago

Hi Chris,

This rxerror parameter is a very sensitive parameter which is dependent of radio shield and affects the RX timing window opening, as you said previously. To understand its effect, you can find attached the drawing concerning the Window Timeout and Window Offset definitions. If it’s configured to a too high value, RX window could overlap Tx and/or RX2 windows, leading to unexpected behavior. I see you are using a STM32WL with US regional parameters and SF7 to SF10 configs. Could you describe your setup if I miss some other configs? On mbed-OS, STM32WL has been validated with max-sys-rx-error = 5, and on STM32CubeWL package, it is validated with value=10 in the LoRa stack. So, setting this value equal to 200 seems really high.

The issue could come from several causes:

another task could be stalling TX in mBed-OS which shift its timing
the Opening window of your Gateway isn’t synchronized with device Did you have the chance to check TX, RX opening windows on logical analyzer? Can you share them? SystemRxErrorParam.pptx

chrissnow commented 3 years ago

Hi,

My early findings might be a bit confusing, I will try and clear a few things up as we have been testing multiple regions, and multiple targets and radios.

Let's simplify it a bit!

WL55JC mbed-os-example-lorawan EU868 TTN as network

If I build and change the messages to always be confirmed once the SF lowers the downlinks are no longer received. However max-sys-rx-error = 10 is enough to make the WL55 reliable

So that is an easy fix for the WL55.

STM32L1

We have a custom board, that is in production but waiting on a firmware release (WL55JC didn't exist at the time, we will move to it later in the year) I will try finer increments of max-sys-rx-error and see if I can get it to work. But the odd thing is the SX1272 works perfectly, which really only leaves the SX126X driver since the timing is done outside that.

We see a timeout IRQ even when we have a large max-sys-rx-error, which I think is because despite the timeout in the RX command being set to forever it will still timeout on a number of symbol times?

I will get some logic traces and report back.

Thanks for that doc, explained much better than the other docs I have.

chrissnow commented 3 years ago

@ludoch-stm I have now narrowed the problem down further. max-sys-rx-error = 10 is enough to make the WL55 reliable on EU868 However it is not reliable on US915, Just building it at 20 to see if it helps.

Have you validated US915 or just EU868?

chrissnow commented 3 years ago

More progress... Seems to be related to MBED_CONF_LORA_DOWNLINK_PREAMBLE_LENGTH Which defaults to 5, However "lora.downlink-preamble-length": 9 Fixes US915 8 also seems fine. Not entirely sure on the correct number though..

Things I have found suspicious

https://github.com/ARMmbed/mbed-os/blob/4c581120c559129c8b6aa834ee1d35020a0dd988/connectivity/lorawan/mbed_lib.json#L80-L82

https://github.com/ARMmbed/mbed-os/blob/96e19afdd196c6c99edd58fddd44e2c691cdca2f/connectivity/drivers/lora/COMPONENT_SX126X/SX126X_LoRaRadio.h#L110-L112

I wonder if that comment is true for the SX127X but not the SX126X? I haven't seen reference to this in the datasheet.

This change hasn't broken EU868 either.

ludoch-stm commented 3 years ago

As the issue is present in US915 band, did you check that your configuration is in Hybrid mode?

To do so, you should configure in mbed_config.h:

define MBED_CONF_LORA_FSB_MASK {0x00FF, 0x0000, 0x0000, 0x0000, 0x0001}

Also, what's your Gateway number of channel: 8 or 64?

chrissnow commented 3 years ago

Were using FSB2 hybrid so channel 8-15+65, 8 channel gateway. We have an FSB mask to match that, but use OTAA so the NS dictates past the join. The frequencies all look correct during operation with tracing enabled.

This seems to work well for us, WL55 or NUCLEO_L152RE + SX126xMB2xAS

"lora.max-sys-rx-error": 10,
"events.use-lowpower-timer-ticker": true,
"target.macros_add": ["MBED_TICKLESS=1"],
"lora.downlink-preamble-length": 9

ludoch-stm commented 3 years ago

OK, good news if it works now in your environment! Perhaps the topic of the conversation can be changed then :-)

chrissnow commented 3 years ago

Done,

Are you going to handle the WL55 max-sys-rx-error needing to be 10? Any thoughts on the correct preamble length?

adbridge commented 3 years ago

@chrissnow @jeromecoutant can this be closed after the merging of 14481 ?

jeromecoutant commented 3 years ago

I would say yes...

chrissnow commented 3 years ago

The WL55 behaves with #14481 but not other targets, I'm pretty sure the default preamble is wrong for all SX126X targets in US915, really needs some confirmation from someone who knows more about LoRa than me...

hallard commented 9 months ago

@chrissnow Thanks I'm not alone with this issue, can you confirm on STM32WL that only "lora.downlink-preamble-length": 9 is needed since #14481 fix rx-error ?

chrissnow commented 9 months ago

@hallard It's been a few years but yes I think so.

hallard commented 9 months ago

Thanks will report back we deployed a lot on EU but it's our first try in US and this downlink issues on STM32WL drove us mad

hallard commented 9 months ago

Just done some tests and looked into the code, 2 preambules length on MBED one for uplink other for downlink

looking at CMakeLists.txt for all frequencies is as follow

MBED_CONF_LORA_DOWNLINK_PREAMBLE_LENGTH=5
MBED_CONF_LORA_UPLINK_PREAMBLE_LENGTH=8

So uplink follow specification with 8 but not downlink with 5

Default compiled program for STM32WL state

max-sys-rx-error = 10 (because of fix in #14481)
uplink preambule = 8
downlink preambule = 5

So my guess that as state @chrissnow lora.downlink-preamble-lengthshould be aligned with uplink and set to 8

we flashed 5 devices in US with this new setting, and got their downlink first time like a charm, before some never get it.

I can do a PR, @jeromecoutant let me know if I'm doing it on all devices or just on STM32WL in /connectivity/lorawan/mbed_lib.json

jeromecoutant commented 9 months ago

I can do a PR, @jeromecoutant let me know if I'm doing it on all devices or just on STM32WL in /connectivity/lorawan/mbed_lib.json

If you have verified then DL new value only with STM32WL, maybe it is safer to update only STM32WL ?

hallard commented 9 months ago

Agree safer, even if I'm pretty sure it will improve downlink for other. Would really like to understand why this value was set to 5 instead of 8 as specification, there is for sure a reason that we ignore.

hallard commented 9 months ago

and here we go #15459

ARMmbed / mbed-os

SX126X: Preamble Length #14449

Description of defect

Target(s) affected by this defect ?

Toolchain(s) (name and version) displaying this defect ?

What version of Mbed-os are you using (tag or sha) ?

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

How is this defect reproduced ?

define MBED_CONF_LORA_FSB_MASK {0x00FF, 0x0000, 0x0000, 0x0000, 0x0001}