Lora-net / LoRaMac-node

Reference implementation and documentation of a LoRa network node.
Other
1.87k stars 0 forks source link

lorawan stucks in LORAMAC_STATUS_BUSY and consumes a lot of energy #1048

Closed CA-s1 closed 1 year ago

CA-s1 commented 3 years ago

Hi,

We are using SX1272 (Region EU868 ClassA) and been successful in getting join/uplink/downlink and it's working with two different gateways (kerlink and objenious)

i actually faced a problem,that Lorawan stuks in LORAMAC_STATUS_BUSY. Every time this issue occurs i'm resetting loramac manually(couldn't do better)


void LORAMAC_reset()
{
    MacCtx.MacState &= ~LORAMAC_TX_RUNNING;
    MacCtx.McpsConfirm.NbRetries = MacCtx.AckTimeoutRetriesCounter;
    MacCtx.McpsConfirm.AckReceived = false;
    MacCtx.McpsConfirm.TxTimeOnAir = 0;
    MacCtx.MacFlags.Bits.MacDone = 1;
}

Could you help me to identify the source of this issue, is there a timer or an Interrupt in charge of resetting MacCtx.MacState ? As an additional detail I've noticed that battery consumption is incredibly higher when this issue occurs.

Thanks for your help.

dmeehan1968 commented 3 years ago

A brief note to say that I think I'm experiencing similar, in a situation where there is gateway loss. In my use case, I am normally sending uplinks unconfirmed, but with a periodic LinkCheckReq to validate connectivity.

There is an edge case depending on device configuration where this causes confirmed uplinks to be sent to log errors, and the confirmed uplink causes device side ADR to degrade the data rate over time to DR1. Ultimately the device gives up and attempts rejoin, but LoRaMacIsBusy() seems to experience that the stack is then permanently busy (like the last message is stuck and being processed indefinitely), and as a result the app code doesn't try the rejoin, thinking it will fail. I can make changes to this logic to work around.

It's a rare event, and takes some hours of gateway outage to manifest, and currently is doing so in only 1 device out of 25 under test. I have some serial logging that has led me to this conclusion, but I don't have the device under debug to be able to break into and check what its doing under the hood.

dmeehan1968 commented 3 years ago

Further to my last comment, I wanted to outline my application behaviour and the result of latest testing.

We have a separate operation to validate the connection to the network server, to facilitate rejoin when needed. In our above test, this occurs AFTER we have already seen the degradation of data rate.

To better understand the issue, we made changes such that:

This seems to demonstrate that there is an issue with how a confirmed message can remain in the stack when there is no response from the NS (due to gateway outage), and end device ADR degrades the data rate to DR0. One would assume that after max retries (NbTrials=8 on all MCPS requests) the uplink should be failed and the stack available for use. This does not appear to be the case.

It is unclear why the data rate for unconfirmed uplinks degrades from the point of gateway outage, or if this is relevant to the stack going permanently busy.

djaeckle commented 3 years ago

Hi all,

thanks for the report. May i ask which version of the stack you are using?

@CA-s1 could you provide some more information about when this situation happens and what leads to it? Is it a similar procedure as @dmeehan1968 explains?

@dmeehan1968 have you made any changes to the default parameters of the region? I'm trying to identify a way to reproduce it. Were you been able to find a way to reproduce it in a deterministic way?

dmeehan1968 commented 3 years ago

May i ask which version of the stack you are using?

4.4.5 - Git Commit 118771fe

have you made any changes to the default parameters of the region?

No

I'm trying to identify a way to reproduce it. Were you been able to find a way to reproduce it in a deterministic way?

No. The problem only manifested itself when we had sensor debug uplinks enabled which vastly increases the number of uplinks (1 every 10 seconds, as opposed to the likely uplink rate of 1 per hour in production).

I provided above a detailed description of what I think is going on, and a test app could be constructed to do similar.

Essentially:

  1. End device ADR enabled, EU868
  2. Uplinks at DR0 at rate of 1 per 10 seconds, payload length should exceed max DR0-2 payload of 64 bytes. I think we use 67.
  3. Switch off gateway, or otherwise prevent downlinks arriving at the end device
  4. Observe DR escalation (e.g. deescalation, 5,4,3...) every 32 uplinks (I think this is triggered when no downlink is received).
  5. When DR2 is achieved, continue sending packets which will be rejected by the stack due to payload length. In response to the uplink rejection, send an uplink with zero length payload. This will cause DR escalation to continue over time.
  6. Once DR0 is achieved, send a confirmed uplink (length may not be significant, I think our error message is ~40 bytes)
  7. The confirmed uplink will take a number of retries before failure, but I believe this is the point at which the stack reports busy continually (not just during the retries).

As described above, at some point in time our application performs a link check process with randomised backoff that ultimately fails due to the lack of downlinks/LinkCheckAns (because of gateway outage) and then goes into a rejoin process with randomised backoff. Even once the gateway connectivity is restored, these join calls fail with stack busy, as a result of state induced by the process described here. As a result, the device does not rejoin in a reasonable time period.

One of the last things we tried was to leave a device that was in this state to see what would happen (rather than manually resetting the device), and one at least one occasion the device rejoined the network after a delay of several days (approx 5-7, I don't recall exactly). The device would have been attempting rejoin at a rate of approx once per hour (the actual time was randomised) and likely at DR2 on each retry. I believe that the stack was reporting busy during this time, rather than duty rate restricted.

Our device is now in production and we've not seen further problems, although mostly because in production the devices will not be using the sensor debug facility which appears to lead to this condition occurring (due to the uplink rate and lack of downlinks). Also, we removed the sending of zero length payload messages as we couldn't see the benefit of this.

It should be noted that in our app, when sensor debug uplinks are enabled, with a viable gateway, we don't see the DR escalation that occurs when the gateway is not present. This seems odd, in that in both cases the end device is 'blind' to the gateway state, yet when the GW is off, the end device seems to 'know' this and perform DR escalation. I can only assume that there is some periodic back channel occurring between end device and GW that allows the end device to sense that the gateway is in range/operational. If that actually is the case, then it seems odd that that is not exposed in the API, because the alternative is an elaborate link check/rejoin scheme at application level. It would seem super easy for many developers to not have identified the need, leading to the possibility of end devices permanently going dark (without some form of intervention). Indeed, its only the aggressive uplink rate of our debug facility that brought to light the possibility of this failure, I suspect many apps/developers would be aware of rare events (depending on uplink rate) that might occur over multiple days that could lead to end devices going permanently busy/dark.

truongn1 commented 3 years ago

Hi @djaeckle,

We are also using 4.4.5 and facing the same issue with confirmed messages, to reproduce this issue, simply like this:

You can eventually catch this issue

InnotasElektronik commented 2 years ago

Hi there, we have the same behavior. It's easy to reproduce. ACK all protocols and transmission interval 2min. First the device will join, then the connection to the GW will be interrupted. The device begins with repetitions of 8 pieces. The SF moves from 7 to 12. After some time, the stack reports a busy after the 4th attempt. In the debugger, the analysis showed that it is not LORAMAC_STATUS_DUTYCYCLE_RESTRICTED, but MacCtx.MacState=2 (LORAMAC_TX_RUNNING ) is the reason. However, the cause is currently unknown.

MarekNovakACRIOS commented 2 years ago

I also use a similar workaround checking for stack being busy for too long as a sanity/safeguard check. IMHO the reason is somehow related to hardware - the stack thinks that a transmission is still running since the IRQ/ISR from SX1261/SX1276 chip is not triggered for some reason. To make handling of this more resistant to hardware "glitches", I think it is necessary to implement a workaround until this is somehow integrated in the stack via some kind of timeout.

As it is likely hardware related, it might be hard to reproduce - so please try to be as much as possible specific when describing the platform you use to reproduce the issue. I encountered this once on a custom board with SX1261, battery operated.

MarekNovakACRIOS commented 2 years ago

Related to https://github.com/Lora-net/LoRaMac-node/issues/1270

bldh commented 1 year ago

I have had issues where the stack has seemingly gotten out of sync with the SX12xx radio, where the stack will be waiting for the busy line to go inactive (I have seen waiting in this busy loop for 10 minutes previously). The SX12xx seems to be asleep as asserting the CS appears to clear the issue (Waiting for a while and if the operation takes longer than you expect asserting the CS).

mluis1 commented 1 year ago

We have added an API to reset the stack internal state machine.

The reported issues look to be related to a quite old version of this project. Therefore our advise is to update to the latest available version. The observed issue may have been corrected as we are not able to reproduce it.

If you don't mind we close this issue because there is now a way to recover from such situations.

In case a similar issue is observed while using the latest available version one can create a new issue and provide the following information: