beegee-tokyo / SX126x-Arduino

Arduino library to use Semtech SX126x LoRa chips and modules to communicate
MIT License
231 stars 64 forks source link

lmh_send() silently discards MAC commands from previous DLs when payload is too large #54

Closed danak6jq closed 2 years ago

danak6jq commented 2 years ago

From LoRaMacHelper.cpp: lmh_send() :

    if (LoRaMacQueryTxPossible(app_data->buffsize, &txInfo) != LORAMAC_STATUS_OK)
    {
        LOG_LIB("LMH", "lmh_send -> LoRaMacQueryTxPossible failed");

        // Send empty frame in order to flush MAC commands
        mcpsReq.Type = MCPS_UNCONFIRMED;
        mcpsReq.Req.Unconfirmed.fBuffer = NULL;
        mcpsReq.Req.Unconfirmed.fBufferSize = 0;
        mcpsReq.Req.Unconfirmed.Datarate = m_param.tx_data_rate;
        // cannot send
        return LMH_ERROR;
    }

This digresses from Semtech reference behavior, which would go ahead and send the empty frame to flush the MAC commands that make the frame too large to send (assuming fOpts isn't too large later on).

I believe I understand the issue here: the reference stack will return OK, so the calling code will believe it has sent the payload, while the stack will send an empty frame. The calling code has no idea.

My question: this change was intentional so calling code could see the error, attempt to send an empty frame, then send the desired payload?

beegee-tokyo commented 2 years ago

These code parts are taken directly from Semtechs LoRaMAC stack V1.0.2, I did not change things there because I do not know much about it and why and how some things are handled there.

danak6jq commented 2 years ago

I had a look at Semtech reference LoRaMac v1.0.2 and v1.0.3 and those releases both go ahead and send an empty frame to flush MAC commands that make fOpts too large - I believe this must have been added by RAK then:

// cannot send return LMH_ERROR;

It has a different behavior, as described above.

danak6jq commented 2 years ago

In fact, it doesn't appear this behavior has changed in the reference since it was committed:

https://github.com/Lora-net/LoRaMac-node/blame/master/src/apps/LoRaMac/common/LmHandler/LmHandler.c

It does present a bit of a challenge for an application that assumes OK means the payload was sent. If an application wants to do re-send of the payload, it needs to know that an empty was sent rather than just "OK", and this is the change apparently made by RAK. It does mean that the app needs to send the empty payload itself, until it can successfully send the desired payload (sending an empty payload can still get MAC commands in return, so it can take more than one try)

danak6jq commented 2 years ago

Additional testing suggests that this LoRaWAN stack doesn't send the MAC commands when sending an empty, but does flush them. Will investigate further; there's clearly something amiss here.

danak6jq commented 2 years ago

The scenario is US915 DR0, with 2x LinkAddrAns MAC commands and 1x DevStatusAns MAC command were received on the last DL. An attempt to send an UL fails because the combined fOpts + payload are too large. This returns an error to the application code, which then sends an empty payload intending to send up the MAC commands.

However, this empty payload UL has no commands in fOpts or in the payload. Apparently the outstanding MAC commands have been discarded by the time I send the empty frame to flush them to the LNS.

Retrying the original UL succeeds, but the MAC commands are never sent.

I suspect this code in LoRaMacQueryTxPossible() is executed on the empty-frame send:

    {
        txInfo->MaxPossiblePayload = txInfo->CurrentPayloadSize;
        // The fOpts don't fit into the maximum payload. Omit the MAC commands to
        // ensure that another uplink is possible.
        fOptLen = 0;
        MacCommandsBufferIndex = 0;
        MacCommandsBufferToRepeatIndex = 0;
    }

which suggests MacCommandsBufferIndex + MacCommandsBufferToRepeatIndex is larger than the max payload size (11).

This code then returns OK:

    // Verify if the fOpts and the payload fit into the maximum payload
    if (ValidatePayloadLength(size, datarate, fOptLen) == false)
    {
        LOG_LIB("LM", "LoRaMacQueryTxPossible -> ValidatePayloadLength failed size = %d DR = %d", size, datarate);

        return LORAMAC_STATUS_LENGTH_ERROR;
    }
    return LORAMAC_STATUS_OK;

because size and fOptlen are both zero. The MAC commands are discarded.

(I don't know how the call to ValidatePayloadLength() is ever valid, because fOptlen is always 0 at this call and neither of txInfo->MaxPossiblePayload or txInfo->CurrentPayloadSize is used, and size is just the application payload size. )

It doesn't seem like fOptLen here should be larger than the max payload size for DR0, but it is a sum of two different MAC command buffers, possibly there's duplication there.

Something is clearly not right here, and this code bears little similarity to the Semtech reference for 1.0.3

I strongly believe at no point should LoRaMacQueryTxPossible() discard MAC commands; subsequent ULs should send as much of the stored MAC commands as possible, which seems to be what the reference code does.

danak6jq commented 2 years ago

In more detail. Here's the first UL from a mote after JOIN:

image

Which the LNS responds to with:

image

Note there are 2 LinkAdrReqs (to set the channel mask and DR) and a DevStatusReq.

Two hours later, the mote wants to do a confirmed UL, but lmh_send() returns an error because payload + fOpts is too large, so the mote sends an empty payload UL (note that it's at DR0 as the earlier AdrReq commanded):

image

However, this UL contains no MAC commands, either in fOpts or in the payload (as expected). We expect this frame to have the DevStatusAns and Ack for LinkAdrReq, but they are missing. So the LNS not surprisingly repeats the MAC commands (DevStatuReq is requested once per hour for testing this scenario):

image

The mote attempts to retry the confirmed UL, but again lmh_send() fails because the LNS sent down the three MAC commands again, the combination of which is again too large, so another empty UL is sent, again without MAC commands:

image

LNS responds with yet another AdrReq commands, because none of them have been Ack'ed yet:

image

Because the DevStatusReq is absent, this time the retry of the Confirmed UL succeeds:

image

Which contains Acks for the LinkAdrReqs. Progress!

Note: I suspect the LoRaWAN stack can send only one LinkAdrAck instead of the two, which might avoid this issue in the first place.

The LoRaWAN stack is discarding MAC responses when there's payload size issue - it appears it may actually happen when the empty payload is sent.

beegee-tokyo commented 2 years ago

So do you have actually a problem or are you just analyzing the LoRaWAN stack?

danak6jq commented 2 years ago

As if I take pleasure in analyzing the LoRaWAN stack.

Yes, I have a problem, though I suppose now that I understand it, I can work-around it for the most part. This probably ought to be documented somewhere.

However, motes discarding MAC commands means that gateways will unnecessarily repeat DLs and increase time spent in DLs (transmitting, not listening, and thus not listening for ULs). Given this is most likely to happen in DR0 just increases the amount of time that GWs will be transmitting and not receiving.

So this is a general issue and appears to be in LoRaMac code provided by RAK. Feel free to prioritize it as you see fit.

beegee-tokyo commented 2 years ago

Not sure where you get all of this. But I think you are wrong in one point. If the payload is too big for the DR, there is no 0 packet sent.

    if (LoRaMacQueryTxPossible(app_data->buffsize, &txInfo) != LORAMAC_STATUS_OK)
    {
        LOG_LIB("LMH", "lmh_send -> LoRaMacQueryTxPossible failed");

        // Send empty frame in order to flush MAC commands
        mcpsReq.Type = MCPS_UNCONFIRMED;
        mcpsReq.Req.Unconfirmed.fBuffer = NULL;
        mcpsReq.Req.Unconfirmed.fBufferSize = 0;
        mcpsReq.Req.Unconfirmed.Datarate = m_param.tx_data_rate;
        // cannot send
        return LMH_ERROR;
    }

returns with an error without initiating the sending. So there is no 0 length packet sent.

danak6jq commented 2 years ago

Not sure where you get all of this. But I think you are wrong in one point. If the payload is too big for the DR, there is no 0 packet sent.

Of course there is a 0 packet sent - see the ChirpStack logs above captured above.

  if (LoRaMacQueryTxPossible(app_data->buffsize, &txInfo) != LORAMAC_STATUS_OK)
  {
      LOG_LIB("LMH", "lmh_send -> LoRaMacQueryTxPossible failed");

      // Send empty frame in order to flush MAC commands
      mcpsReq.Type = MCPS_UNCONFIRMED;
      mcpsReq.Req.Unconfirmed.fBuffer = NULL;
      mcpsReq.Req.Unconfirmed.fBufferSize = 0;
      mcpsReq.Req.Unconfirmed.Datarate = m_param.tx_data_rate;
      // cannot send
      return LMH_ERROR;
  }

returns with an error without initiating the sending. So there is no 0 length packet sent.

Once I realized the too-big frame results in the error return you show here with no frame being sent, I added the attempted recovery to the application (explicitly send a 0 packet in hopes that will flush out the MAC commands).

Explicitly sending a 0 frame does send a 0 frame - but it discards the MAC commands before doing so. That's surely a bug - the reference stack, at least the 1.0.3 version, implicitly sends the 0 frame with the MAC commands intact, but returns no error from the send so the application doesn't appear to have any idea the intended payload was not sent. That bit of code above there that returns LMH_ERROR appears to an RAK change to the Semtech code.

Since I've already deployed a few dozen RAK4631-based motes and have 5 dozen more in the shop, perhaps I should contact someone at RAK - any suggestion who?

danak6jq commented 2 years ago

Any update on what a professional/volume deployment of RAK Wireless hardware should be using? Is RAK3172 the suggested alternative for actual product deployment?

beegee-tokyo commented 2 years ago

inquiry@rakwireless.com brings you in contact with the sales team.