lorawan stucks in LORAMAC_STATUS_BUSY and consumes a lot of energy

CA-s1 commented 3 years ago

Hi,

We are using SX1272 (Region EU868 ClassA) and been successful in getting join/uplink/downlink and it's working with two different gateways (kerlink and objenious)

i actually faced a problem,that Lorawan stuks in LORAMAC_STATUS_BUSY. Every time this issue occurs i'm resetting loramac manually(couldn't do better)


void LORAMAC_reset()
{
    MacCtx.MacState &= ~LORAMAC_TX_RUNNING;
    MacCtx.McpsConfirm.NbRetries = MacCtx.AckTimeoutRetriesCounter;
    MacCtx.McpsConfirm.AckReceived = false;
    MacCtx.McpsConfirm.TxTimeOnAir = 0;
    MacCtx.MacFlags.Bits.MacDone = 1;
}

Could you help me to identify the source of this issue, is there a timer or an Interrupt in charge of resetting MacCtx.MacState ? As an additional detail I've noticed that battery consumption is incredibly higher when this issue occurs.

Thanks for your help.

dmeehan1968 commented 3 years ago

A brief note to say that I think I'm experiencing similar, in a situation where there is gateway loss. In my use case, I am normally sending uplinks unconfirmed, but with a periodic LinkCheckReq to validate connectivity.

There is an edge case depending on device configuration where this causes confirmed uplinks to be sent to log errors, and the confirmed uplink causes device side ADR to degrade the data rate over time to DR1. Ultimately the device gives up and attempts rejoin, but LoRaMacIsBusy() seems to experience that the stack is then permanently busy (like the last message is stuck and being processed indefinitely), and as a result the app code doesn't try the rejoin, thinking it will fail. I can make changes to this logic to work around.

It's a rare event, and takes some hours of gateway outage to manifest, and currently is doing so in only 1 device out of 25 under test. I have some serial logging that has led me to this conclusion, but I don't have the device under debug to be able to break into and check what its doing under the hood.

dmeehan1968 commented 3 years ago

Further to my last comment, I wanted to outline my application behaviour and the result of latest testing.

The application is a temperature monitor
LoRaMac is configured for end device ADR
In normal operation the application sends min/max/avg for temps over a 1 hour period, again, using unconfirmed uplinks.
For debug purposes we are sending measurements every 10 seconds using unconfirmed uplinks.
For testing purposes, we have a scheduled power loss on the sole gateway. The gateway is in close proximity and connects at DR5 in EU868 region
What we see from logging of the MCPS Confirm callbacks is that the unconfirmed messages are sent ok at DR5 up until the time the gateway is powered off
After gateway loss, we see that the uplinks are still being sent (10 second interval), but that the data rate degrades over time (-1 DR per 32 uplinks). This seems odd as to my knowledge, the end device had no way of knowing if the gateway has disappeared, because there is no requirement on these messages (unconfirmed and no link check).
Once the DR degrades to DR2, the debug messages can no longer be sent, as the max payload length is less than required. As originally implemented, we were sending a NULL payload message to flush any MAC commands (which shouldn't be the case), as per the behaviour of the demo application provided in the repo. As a result, these null messages would prevent normal message transmission and we would added the normal messages to a queue with periodic retries to see if the stack would recover ability to send (e.g. increase in DR). This queue holds at most 10 messages for up to 180 seconds. Due to the debug send rate the queue would go full and we generated a new uplink to report the error, and THIS error message is sent CONFIRMED. At this point we no longer saw MCPS Confirmed messages, and suspect that the DR degrades to DR0 and should run out of retries, but it appears this is the message that leads to the stack being permanently busy. We used LoRaMacIsBusy() to determine if other operations were possible, including (re)join, and as a result our application ceased to function as intended.

We have a separate operation to validate the connection to the network server, to facilitate rejoin when needed. In our above test, this occurs AFTER we have already seen the degradation of data rate.

At a multi-hour interval (in testing, 4 hours) we send an unconfirmed message with a LinkCheckReq which is designed to validate the connection to the network server and application server (the AS will create a downlink in response to the uplink).
This unconfirmed message is repeated at a random backoff interval in order to solicit the downlink. This happens in parallel to the normal operation (hourly uplinks, and in debug, 10 second uplinks), so its possible that the downlink comes in the RX window of another message, but the application handles this.
In the event that as LinkCheckAns is received, the network connection is considered valid, and the non-response is attributed to AS outage, so the device continues normal operation
In the event that no LinkCheckAns is received, the network connection is considered invalid and a randomised rejoin process is instigated that gradually reduces data rate over time. Please note that in the above tests this process DOES NOT overlap degradation of DR witnessed on uplinks after the gateway outage occurs.

To better understand the issue, we made changes such that:

Error messages relating to the uplink queue overflow are sent unconfirmed
On continued sending of uplink messages, MCPS confirm indicated that the DR continued to degrade to DR0 and remained there.
The stack DID NOT go permanently busy, and the application recovered as expected at the end of the gateway outage

This seems to demonstrate that there is an issue with how a confirmed message can remain in the stack when there is no response from the NS (due to gateway outage), and end device ADR degrades the data rate to DR0. One would assume that after max retries (NbTrials=8 on all MCPS requests) the uplink should be failed and the stack available for use. This does not appear to be the case.

It is unclear why the data rate for unconfirmed uplinks degrades from the point of gateway outage, or if this is relevant to the stack going permanently busy.

djaeckle commented 3 years ago

Hi all,

thanks for the report. May i ask which version of the stack you are using?

@CA-s1 could you provide some more information about when this situation happens and what leads to it? Is it a similar procedure as @dmeehan1968 explains?

@dmeehan1968 have you made any changes to the default parameters of the region? I'm trying to identify a way to reproduce it. Were you been able to find a way to reproduce it in a deterministic way?

dmeehan1968 commented 3 years ago

May i ask which version of the stack you are using?

4.4.5 - Git Commit 118771fe

have you made any changes to the default parameters of the region?

No

I'm trying to identify a way to reproduce it. Were you been able to find a way to reproduce it in a deterministic way?

No. The problem only manifested itself when we had sensor debug uplinks enabled which vastly increases the number of uplinks (1 every 10 seconds, as opposed to the likely uplink rate of 1 per hour in production).

I provided above a detailed description of what I think is going on, and a test app could be constructed to do similar.

Essentially:

End device ADR enabled, EU868
Uplinks at DR0 at rate of 1 per 10 seconds, payload length should exceed max DR0-2 payload of 64 bytes. I think we use 67.
Switch off gateway, or otherwise prevent downlinks arriving at the end device
Observe DR escalation (e.g. deescalation, 5,4,3...) every 32 uplinks (I think this is triggered when no downlink is received).
When DR2 is achieved, continue sending packets which will be rejected by the stack due to payload length. In response to the uplink rejection, send an uplink with zero length payload. This will cause DR escalation to continue over time.
Once DR0 is achieved, send a confirmed uplink (length may not be significant, I think our error message is ~40 bytes)
The confirmed uplink will take a number of retries before failure, but I believe this is the point at which the stack reports busy continually (not just during the retries).

As described above, at some point in time our application performs a link check process with randomised backoff that ultimately fails due to the lack of downlinks/LinkCheckAns (because of gateway outage) and then goes into a rejoin process with randomised backoff. Even once the gateway connectivity is restored, these join calls fail with stack busy, as a result of state induced by the process described here. As a result, the device does not rejoin in a reasonable time period.

One of the last things we tried was to leave a device that was in this state to see what would happen (rather than manually resetting the device), and one at least one occasion the device rejoined the network after a delay of several days (approx 5-7, I don't recall exactly). The device would have been attempting rejoin at a rate of approx once per hour (the actual time was randomised) and likely at DR2 on each retry. I believe that the stack was reporting busy during this time, rather than duty rate restricted.

Our device is now in production and we've not seen further problems, although mostly because in production the devices will not be using the sensor debug facility which appears to lead to this condition occurring (due to the uplink rate and lack of downlinks). Also, we removed the sending of zero length payload messages as we couldn't see the benefit of this.

It should be noted that in our app, when sensor debug uplinks are enabled, with a viable gateway, we don't see the DR escalation that occurs when the gateway is not present. This seems odd, in that in both cases the end device is 'blind' to the gateway state, yet when the GW is off, the end device seems to 'know' this and perform DR escalation. I can only assume that there is some periodic back channel occurring between end device and GW that allows the end device to sense that the gateway is in range/operational. If that actually is the case, then it seems odd that that is not exposed in the API, because the alternative is an elaborate link check/rejoin scheme at application level. It would seem super easy for many developers to not have identified the need, leading to the possibility of end devices permanently going dark (without some form of intervention). Indeed, its only the aggressive uplink rate of our debug facility that brought to light the possibility of this failure, I suspect many apps/developers would be aware of rare events (depending on uplink rate) that might occur over multiple days that could lead to end devices going permanently busy/dark.

truongn1 commented 3 years ago

Hi @djaeckle,

We are also using 4.4.5 and facing the same issue with confirmed messages, to reproduce this issue, simply like this:

Battery operated end-node
Only send confirmed messages, in a weak signal strength area, RSSI around -100dBm to -110dBm

You can eventually catch this issue

InnotasElektronik commented 2 years ago

Hi there, we have the same behavior. It's easy to reproduce. ACK all protocols and transmission interval 2min. First the device will join, then the connection to the GW will be interrupted. The device begins with repetitions of 8 pieces. The SF moves from 7 to 12. After some time, the stack reports a busy after the 4th attempt. In the debugger, the analysis showed that it is not LORAMAC_STATUS_DUTYCYCLE_RESTRICTED, but MacCtx.MacState=2 (LORAMAC_TX_RUNNING ) is the reason. However, the cause is currently unknown.

MarekNovakACRIOS commented 2 years ago

I also use a similar workaround checking for stack being busy for too long as a sanity/safeguard check. IMHO the reason is somehow related to hardware - the stack thinks that a transmission is still running since the IRQ/ISR from SX1261/SX1276 chip is not triggered for some reason. To make handling of this more resistant to hardware "glitches", I think it is necessary to implement a workaround until this is somehow integrated in the stack via some kind of timeout.

As it is likely hardware related, it might be hard to reproduce - so please try to be as much as possible specific when describing the platform you use to reproduce the issue. I encountered this once on a custom board with SX1261, battery operated.

MarekNovakACRIOS commented 2 years ago

bldh commented 1 year ago

I have had issues where the stack has seemingly gotten out of sync with the SX12xx radio, where the stack will be waiting for the busy line to go inactive (I have seen waiting in this busy loop for 10 minutes previously). The SX12xx seems to be asleep as asserting the CS appears to clear the issue (Waiting for a while and if the operation takes longer than you expect asserting the CS).

mluis1 commented 1 year ago

We have added an API to reset the stack internal state machine.

The reported issues look to be related to a quite old version of this project. Therefore our advise is to update to the latest available version. The observed issue may have been corrected as we are not able to reproduce it.

If you don't mind we close this issue because there is now a way to recover from such situations.

In case a similar issue is observed while using the latest available version one can create a new issue and provide the following information:

Used version of the project
Used network server
Detailed description of the steps required to reproduce the issue
If the reporter is able to understand and implement a fix contributions are welcome.

Lora-net / LoRaMac-node

lorawan stucks in LORAMAC_STATUS_BUSY and consumes a lot of energy #1048