[LoRaWAN] Transmission stops after about a day

SloMusti commented 6 years ago

I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors.

The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing.

Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently. loradiscoveryttnworking.txt

GrumpyOldPizza commented 6 years ago

Thanx. I'll give it a try.

On Sat, Jul 28, 2018 at 10:21 AM, SloMusti notifications@github.com wrote:

I have been testing the robustness of the Murata module by using B-L072Z-LRWAN1 with this core, version 0.0.7 and have encountered an issue where about a day later the transmissions to gateway stop. This has been confirmed on all the boards with multiple gateways in two different cities to exclude other factors.

The code running on the device is attached below, simple transmission every 10s. I have yet to capture the serial log until the crash, but it does not appear to be an issue with the main code loop that keep executing.

Any suggestions or ideas towards debugging this are welcome as well as if anyone else can please test this independently. loradiscoveryttnworking.txt https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/files/2238116/loradiscoveryttnworking.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfEkndZbBIkPw5JYkxRzfFAd2Ua_gks5uLI-DgaJpZM4VlGSF .

SloMusti commented 6 years ago

Furthermore, this issue has been reported by @s54mtb as well using unrelated firmware to this repository, so there may as well be something STM/Murata related: https://github.com/s54mtb/LoRaDunchy/tree/master/sw

SloMusti commented 6 years ago

Tracing possible causes now with serial logging and power analyzer. Once thing is apparent now, the data rate changes due to ADR, will try to correlate if that is an issue.

TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76675, NextTxTime: 0, MaxPayloadSize: 242, DR: 5, TxPower: 12.0dbm, UpLinkCounter: 66, DownLinkCounter: 0 )

messages are received by two gateways:
{
  "time": "2018-07-28T18:13:20.698336253Z",
  "frequency": 867.9,
  "modulation": "LORA",
  "data_rate": "SF7BW125",
  "coding_rate": "4/5",
  "gateways": [
    {
      "gtw_id": "XXXX",
      "gtw_trusted": true,
      "timestamp": 929034364,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -109,
      "snr": 6.25,
      "latitude": 46.554905,
      "longitude": 15.635378
    },
    {
      "gtw_id": "YYYY",
      "gtw_trusted": true,
      "timestamp": 4006964724,
      "time": "2018-07-28T18:13:20Z",
      "channel": 7,
      "rssi": -73,
      "snr": 9.75
    }
  ]
}

SloMusti commented 6 years ago

Observed the hang now with serial attached, now the transmissions stopped when ADR was supposed to change to DR5

TRANSMIT( TimeOnAir: 70843, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 60, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 71999, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 61, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 73155, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 62, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 74311, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 63, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 75467, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 64, DownLinkCounter: 0 )
TRANSMIT( TimeOnAir: 76623, NextTxTime: 0, MaxPayloadSize: 51, DR: 0, TxPower: 16.0dbm, UpLinkCounter: 65, DownLinkCounter: 0 )

GrumpyOldPizza commented 6 years ago

Is there a way for you to redirect the output to a UART instead of USB ? I'd like to isolate whether it's a USB issues perhaps. Looks like you see this after 65 downlinks. Does this always happen at that point ?

SloMusti commented 6 years ago

I can do that, however it does not appear always at this point, I have also disabled ADR and the problem remains, so ti may not be directly correlated.

SloMusti commented 6 years ago

The logging has been via serial and the fault persists, so definitely not related to the issue.

We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza can you please let me know if you replicate the issue. Note we are using 868MHz EU band.

GrumpyOldPizza commented 6 years ago

I have not been able to reproduce the issue.

Is it possible that it is gateway related ?

On Thu, Aug 2, 2018 at 5:41 AM, SloMusti notifications@github.com wrote:

The logging has been via serial and the fault persists, so definitely not related to the issue.

We have now tested on 4 devices, all behaving exactly the same. @GrumpyOldPizza https://github.com/GrumpyOldPizza can you please let me know if you replicate the issue. Note we are using 868MHz EU band.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-409897976, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfHVfurCdT4H-Z9KgJPa_CLjtuowwks5uMuV-gaJpZM4VlGSF .

SloMusti commented 6 years ago

@GrumpyOldPizza this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other?

GrumpyOldPizza commented 6 years ago

I am using Multitech gateways.

On Sat, Aug 4, 2018, 9:11 PM SloMusti notifications@github.com wrote:

@GrumpyOldPizza https://github.com/GrumpyOldPizza this was tested on 5+ gateways in different cities, running on Raspberry PI + RAK831 or IC880a or Laird indoor. The common factor to them is that this is using TheThingsNetwork servers. Are you using those or Loriot or other?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-410471279, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfEUGtS5_lSbHUrxhVE0erbxaY6Foks5uNfHwgaJpZM4VlGSF .

SloMusti commented 6 years ago

@GrumpyOldPizza ok, but with what backend?

s54mtb commented 6 years ago

Hi! I had similar issues with murata modules and ST LoraWan stack.

I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro.

The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets.

The hardware used for testing: http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/

Complete sensor used for the testing: http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/

The latest software was commited here: https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor

My changes compared to the demo application:

power down is not being used, since PM sensor consumes quite some power and everything is powered constantly.
duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)
VCOM is not being used
I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used)
a counter has been added, which re-join after half an hour. Without that none of the modules was working longer than few hours. Rejoining didn't resolved the issue, it just prolonged the time to stop sending data.

When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok.

I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue.

Gateways and backend is same as @SloMusti reported above.

GrumpyOldPizza commented 6 years ago

Let me recheck this on my local gateways. My last tests were about a week long with testing recovery from power outages. But I did not see anything like this. However this was US915.

On Sun, Aug 5, 2018, 11:42 PM Marko Pavlin notifications@github.com wrote:

Hi! I had similar issues with murata modules and ST LoraWan stack.

I was running 5 different sensors using muRata Type ABZ module and LoRaWAN stack from STMicro.

The application hangs after random time from few hours to several days (not more than 3 days). A module hanging after 50 packets sent dies, but then again send data more than 1k packets.

The hardware used for testing: http://e.pavlin.si/2018/05/07/lora-module-in-dil-form/

Complete sensor used for the testing: http://e.pavlin.si/2018/07/03/particle-sensor-with-lora/

The latest software was commited here: https://github.com/s54mtb/LoRaDunchy/tree/master/sw/Projects/PM-Sensor

My changes compared to the demo application:

-

power down is not being used, since PM sensor consumes quite some power and everything is powered constantly.

duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)

VCOM is not being used

I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used)

a counter has been added, which re-join after half an hour. Without that none of the modules was working longer than few hours. Rejoining didn't resolved the issue, it just prolonged the time to stop sending data.

When module hangs, LoraSend() is being executed, but no signal gets through (TTN receives no data). MCU is alive, timers are ok, sensor readings are ok.

I also tested sending without any sensor interaction (just sending constant numbers instead of actual sensor readout) and it had no influence on occurance of the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-410550229, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfEf5SifdbAXlADT24X9iosC-FwQZks5uN2bEgaJpZM4VlGSF .

s54mtb commented 6 years ago

After long field testing period I got some results:

LoraSend() function included in STM examples is called from RTC ISR -> trouble with pending IRQ if UART IO using interrupts is called from this function. Solved this by using UART between lora sends (while waiting for the next RTC alarm timeout)
Added I2C pullup resistors for I2C: internal pullups are not OK.
Upgraded ST lorawan stack to version 1.2.0: running provided examples on demo board with STM sensor "shiled" worked for 10+ days without an issue.
Removed all UART code for VCOM/diagnostic output and use UART for the HPM (particle) sensor only. HPM sensor seems to freeze from time to time
Added external transistor for switching power for the HPM sensor. Main purpose of this is to reset the HPM sensor when error is detected during readout, because the HPM sensor has no command for "reset". This improved the reliability of the operation.

It seems the major issues were in the periphery and not in stack and mostly related to proper configuration of the MCU/NVIC. That was not documented properly in the first versions of the STM stack. Latest updated documentation provided by STM is much more detailed and it helped solving issues with NVIC.

GrumpyOldPizza commented 6 years ago

So this is really not related to ArduinoCore-stm32l0. Again, I have not seen those problems here at all.

SloMusti commented 6 years ago

@GrumpyOldPizza I was able to observe such a problem with ArduinoCore-stm32l0, the device stopping transmissions after a while. Can you please point me to what version of the STM Lora stack this core is running and where it would be best to evaluate interrupt priorities, should this be really the cause of hangups after a while.

GrumpyOldPizza commented 6 years ago

The stack is derived from LoRaMac-node 4.4.1. I doubt that it's the interrupt priorities. RTC based timeouts and DIO IRQ handling, which drive the stack are escalated to PENDSV callback. So are common peripheral callbacks, like "Serial.onReceive()" (which you are unlikely to use).

There is of couse always the chance of another bug somewhere. But strikes me as curious is that you see this issue pretty much as the only one.

SloMusti commented 6 years ago

@GrumpyOldPizza Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt.

GrumpyOldPizza commented 6 years ago

Obviously, yes.

On Thu, Sep 13, 2018 at 12:45 PM SloMusti notifications@github.com wrote:

@GrumpyOldPizza https://github.com/GrumpyOldPizza Just checking, did you test most of the nodes in the US or EU bands, should there be anything related to that, which I doubt.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-421111647, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfBkwputtMKOeIIdTIJLG2wvS6nhwks5uaqfkgaJpZM4VlGSF .

SloMusti commented 6 years ago

I have performed the experiment in the following configuration:

B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code B-L072Z-LRWAN1 board 1: LoraWAN-TTN-OTAA example code with NO Serial

Both crashed after about 4000 messages almost simultaneously. Repeating the experiment now to validate..

GrumpyOldPizza commented 6 years ago

Did you use "setDutyCycle(false)" or the default ?

GrumpyOldPizza commented 6 years ago

Ok, used the LoRaWAN_OTA.ino example with "setDutyCycle(false)". After almost 24 hours and 8500 transmissions, it's still alive on B-L072Z-LRWAN1. This is on EU868.

GrumpyOldPizza commented 6 years ago

Tried a 2nd board with ADR off, hence always DR_0. That one also survived a day without a crash. The first board is now on day 2 1/2 with a message every 10 seconds. Also no crash or anything.

Unless there is a good reason to keep this open, I am gonna close the issue.

SloMusti commented 6 years ago

I am repeating the same test as you have defined, will need to wait a day or so to see if a crash occurs and then report back.

SloMusti commented 6 years ago

Actually, I have just now observed a crash on both devices with LoRaWAN_OTA.ino example with "setDutyCycle(false)". One had 584 messages, other 364. Next thing to try is ADR off and see if that affects.

GW config: RAK831 on RPi Lorix One

GrumpyOldPizza commented 6 years ago

I am not sure what to do. It works fine here with 2 B-L072Z-LRWAN1 boards, as well as all others. I have no other mentioning from anybody else about sudden crashes after a short period of time.

Obviously I am using a different gateway (and am on Linux).

What are the last 50 messages printed out via serial console ?

Otherwise I'd suggest you contact me via grumpyoldpizza@gmail.com so that you can arrange to send me your hardware (RAK831 gateway an one of the failing B-L072Z-LRWAN1 boards).

GrumpyOldPizza commented 6 years ago

Ok, got a repro after 3 days. I am not positive it's the same issue as you got, but it's possible. Essentially a corrupted frame on RX1 will keep LoRaWAN.busy() set to true (triggered for me by ADR). I tend to believe that a multicast frame not address to this node may cause this as well.

In general it may be possible that the gateway sends some invalid packet (or LoRaWAN 1.1 extension to a LoRaWAN 1.0.2 node), which might trip up the LoRaWAN class as well.

That will take a few days to sort out.

SloMusti commented 6 years ago

Well spotted, thanks you for the effort.

I believe it would be also good to figure out a watchdog, such that if any such problems appear when device has been deployed somewhere inconvenient, that would not be the case. Did you happen to look into this yet with this core?

GrumpyOldPizza commented 6 years ago

A watchdog will not help there. It's a internal bug where the code waitw for a McpsIndication that either never arrives, or arrives with an error that was not documented originally (multicast).

Should be half way simple to fix. But I need to crosscheck all code paths in LoRaMac-node to see whether other errors can pop up (that are not handled properly). My bigger problem is how to test this. Where I am located physically there are no other gateways close by, only some faint US915 ones ... So checking out those boundary conditions is tricky.

SloMusti commented 6 years ago

So far I have observer regular crashes at my location, so I am happy to run tests when necessary. Alternatively I can provision a RPi and you can upload remotely and test. Would that work?

GrumpyOldPizza commented 6 years ago

Since the issue has to do with other LoRaWAN traffic ... doesn't make sense to send me anything. I had assumed a Gateway issue, or a simple hardware issue with B-L072Z-LRWAN1 before.

I'll test locally on US915 and see whether the fix I have survives a good chunk of packets (switched to 5 second intervals).

The github will be updated in a few hours after the first shakedown.

GrumpyOldPizza commented 6 years ago

I have updated the repository with the proper fix. Will test over night (and the next few days) whether it does not introduce another issue. So no updated json file yet.

Mind either installing via github, or simply copy the updated LoRaWAN.cpp into the proper place ?

SloMusti commented 6 years ago

@GrumpyOldPizza I ahve been testing your code for 2 days now and it still works on two devices.

SloMusti commented 6 years ago

@s54mtb reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to test if the same thing happens with this core.

Workaround at the moment is:

    uplinkcounter = GetUplinkCounter();
    if (uplinkcounter >= 0x0000ffff) {
        NVIC_SystemReset();   // Reset everything
    }

GrumpyOldPizza commented 6 years ago

I have here 4 boards (1x B-L072Z-LRWAN1 and 3x Grasshopper) doing various different things, pinging on the same Gateway. No failure so far.

On Thu, Sep 20, 2018 at 9:45 AM SloMusti notifications@github.com wrote:

@GrumpyOldPizza https://github.com/GrumpyOldPizza I ahve been testing your code for 2 days now ad it still works on two devices.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-423231906, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfLHeH_KU4ma4--nt1WkY8MtRgYTOks5uc7f9gaJpZM4VlGSF .

GrumpyOldPizza commented 6 years ago

I think 1.0 and 1.0.1 allows for 16 bit counters. 1.0.2 is 32 bit clean per standard.

However packet wise only the lower 16 bits get transmitted.

The code in LoRaMac-node 4.4.1 is ok. The one in 4.4.2 is busted:

    // Add difference, consider roll-over
    fCntDiff = ( int32_t )macMsg->FHDR.FCnt - ( int32_t )( previousDown

& 0x0000FFFF );

Cannot do int32_t to get to a int16_t rollover using wraparound.

On Thu, Sep 20, 2018 at 9:48 AM SloMusti notifications@github.com wrote:

@s54mtb https://github.com/s54mtb reports another problem, not using this core but STM stack directly, with frame counters, where the loramac hangs upon reaching the maximal frame counter value 0xffff. This has been repeated with LoRaMacSetFCntUp() and including "LoRaMacFCntHandler.h" Would be good to test if the same thing happens with this core.

Workaround at the moment is:
uplinkcounter = GetUplinkCounter();
if (uplinkcounter >= 0x0000ffff) {
    NVIC_SystemReset();   // Reset everything
}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GrumpyOldPizza/ArduinoCore-stm32l0/issues/27#issuecomment-423232952, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4QfOJLZHOi1K1rBl1aEcOgNbI-1y64ks5uc7ivgaJpZM4VlGSF .

SloMusti commented 6 years ago

No failures on my side either, currently at 80000+ frames on two devices.

GrumpyOldPizza commented 6 years ago

Ok, closing out. Here it's been alive for a week or so, every 5 seconds ...

GrumpyOldPizza / ArduinoCore-stm32l0

[LoRaWAN] Transmission stops after about a day #27

power down is not being used, since PM sensor consumes quite some power and everything is powered constantly.

duty cycle is 30' seconds (APP_TX_DUTYCYCLE 30000)

VCOM is not being used

I2C and UART communication for sensors has been added (no dynamic memory/ heap is being used)