TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
979 stars 308 forks source link

CFList inconsistent with United States 902-928 MHz, FSB 2 #5551

Closed cprovidenti closed 1 year ago

cprovidenti commented 2 years ago

Summary

The CFList in the Join-Accept received by the end-device is inconsistent with the frequency plan United States 902-928 MHz, FSB 2. This frequency plan specifies that uplinks can be received on US915 channels 8-15 and 65, but the ChMask received from The Things Stack enables channels 8-15 only. It seems to me that it should enable channel 65 as well.

Note: The same inconsistency may affect the other US915 channel plans (FSB 1 and FSB 3-8), as well as the corresponding AU915 channel plans, but I have not checked that.

FYI: The symptom that alerted me to this inconsistency is that the end-device can join using DR4 (a.k.a. DR12), but is subsequently "unable" (i.e., not permitted) to transmit (Unconfirmed/Confirmed) uplinks using DR4.

Steps to Reproduce

  1. Configure a LoRaWAN 1.0.4 US915 end-device for OTAA, and to use the channel plan identified in the summary (above).
  2. Configure a US915 gateway accordingly and connect it to The Things Network.
  3. Enable "Verbose stream" in the "Live Data" tab of the console for the aforementioned US915 end-device (see step 1).
  4. Force that end-device to undergo OTAA (i.e., rejoin The Things Network), e.g., by resetting or power-cycling the end-device.
  5. Inspect the "ch_masks" value for the (verbose) entry type "Schedule join-accept for transmission" in the "Live Data" tab.

Note: The attached screenshots may clarify some of the above steps.

What do you see now?

The "ch_masks" value enables US915 channels 8-15 only.

Note: This was confirmed by modifying the end-device software to print out the ChMask it received.

What do you want to see instead?

The "ch_masks" value should enable US915 channel 65 as well.

Alternatively: The frequency plan "United States 902-928 MHz, FSB 2" should be updated to match the "ch_masks" value. I.e., remove the "lora-standard-channel" entry therein.

Environment

TTS v3.20.0; LoRaWAN 1.0.4 end-device using OTAA; US915 as per RP002-1.0.3, using frequency plan "United States 902-928 MHz, FSB 2."

Note: I do not recall observing the same end-device symptom (described in "Summary") before last week, so perhaps the inconsistency was introduced in v3.20.0? (Upgrade to that version occurred on June 13, 2022.)

How do you propose to implement this?

I do not.

How do you propose to test this?

N/A.

Can you do this yourself and submit a Pull Request?

N/A. End-Device-Overview Join-Accept-ChMask-0-45 Join-Accept-ChMask-45-71

adriansmares commented 2 years ago

The issue at hand is that the LoRa Standard Channel, and FSK Channel operate on data rates that run at 500 KHz, respectively 250 KHz. This renders the data rates incompatible with our ADR algorithm.

The Network Server has a singular way of communicating the requested data rate to the end device - the data rate index in the LinkADRReq MAC command - we don't have a range or possibility of asking the end device to use two data rates (I'm thinking of two data rates with same spreading factor, but different bandwidth). A corollary of this is that if we ask the end device to use a 500 KHz data rate, we're reducing the number of available channels from 8 to 1, which is inherently a quite bad move (this increases the risk of collisions heavily for end devices which operate in the same data rate).

With that being said, one may argue that when ADR is disabled, we don't have a good justification to leave the channel out, and I tend to agree with that argument. When ADR is completely disabled in the end device settings, we probably can and should enable the 500 / 250 KHz channels.

ama9910 commented 1 year ago

I can confirm this issue also occurs on AU915.

Here is some more detail:

The LoRaWAN 1.0.4 spec requires devices to support DR6 (DR4 for US):

AU915-928 devices SHALL support one of the two following data rate options:

  1. [DR0 to DR6] and [DR8 to DR13] (minimum set supported for certification)
  2. [DR0 to DR13] (all data rates implemented)

It is also requires the use of the high bandwidth channel when joining (for both US and AU plans):

If using the over-the-air activation procedure, the end-device SHALL broadcast the Join- Request message alternatively on a random 125 kHz channel amongst the 64 channels defined using DR2 and on a 500 kHz channel amongst the 8 channels defined using DR6. The end-device SHOULD change channel for every transmission.

The issue we're seeing on both AU915 and US915 band-plans is that 50% of the time, devices will attempt to join on the high bandwidth channel. The join is successful. Once joined, they receive a CF List from the network which excludes that channel. From that point on, no uplinks occur as the devices report a "No enabled channel" error internally. The devices are now no longer contactable and have to be physically rebooted in order to rejoin the network.

In my case the devices use Multitech xDot, running libxDot 4.1.4. The problem is easily reproducible with Multitech's AT firmware as well.

@adriansmares I see your comment about incompatibility with ADR, but is there a possible work-around for this? Perhaps:

adriansmares commented 1 year ago

This is an interesting issue and I now realize that it is not as clear cut as I hoped. The standard says that the CFList is fundamentally equivalent to a set of NewChannelReq/LinkADRReq commands:

image

What is not written out here, but is implicit, is that the LinkADRReq equivalent has to use data rate index 15 behavior (i.e. do not change, because there is no field for the data rate index).

This means that we have basically the following sequence of events, if we translate to MAC commands:

  1. End device has all 72 channels enabled, and the current data rate is a 500KHz one.
  2. NS sends a LinkADRReq that disables the 500KHz channels, but does not change the data rate index (i.e. data rate index 15 behavior).
  3. The end device has no other option but to reject this LinkADRReq, as it would render it muted.

I sort of agree that the NS is in the wrong here. I think that what we should do is indeed to allow the join request channel to be reused as part of the CFList, and on the next uplink to disable it, along with the data rate index change to the 125KHz channels.

This can go wrong in a very specific edge case: if the join request was received via roaming, and the roaming agreement allows only join accept downlinks, but not MAC downlinks. In such cases we should reject the join request. We can avoid this by always rejecting the join request if the join accept would lead to this situation, but this is very complex and I am not sure if it is worth dealing with this at the moment.

cc @johanstokking if you have anything to add.

johanstokking commented 1 year ago

The issue we're seeing on both AU915 and US915 band-plans is that 50% of the time, devices will attempt to join on the high bandwidth channel. The join is successful. Once joined, they receive a CF List from the network which excludes that channel. From that point on, no uplinks occur as the devices report a "No enabled channel" error internally. The devices are now no longer contactable and have to be physically rebooted in order to rejoin the network.

But that would mean that the end device implicitly disabled the other channels. Why is that the case?

  • End device has all 72 channels enabled, and the current data rate is a 500KHz one.

  • NS sends a LinkADRReq that disables the 500KHz channels, but does not change the data rate index (i.e. data rate index 15 behavior).

  • The end device has no other option but to reject this LinkADRReq, as it would render it muted.

Same question here; there are 8 125 KHz channels enabled via the CFList. Why would DR0-DR3 on those channels be disabled?

adriansmares commented 1 year ago

Same question here; there are 8 125 KHz channels enabled via the CFList. Why would DR0-DR3 on those channels be disabled?

Because we've asked the end device to use a data rate which is not compatible with these channels, which is invalid per LinkADRReq specification - you are not allowed to mute the end device (by enabling only channels which do not operate with the provided data rate index).

The specification talks about this while discussing the semantics of DataRateACK bit of LinkADRAns:

image

Specifically The data rate requested is unknown to the end-device or is not possible, given the channel mask provided (not supported by any of the enabled channels).

We do not have any data rate index in the join accept, so implicitly the data rate index stays the same. I've hoped that the RP2 document speaks about some form of default/implicit data rate index, but we don't have this concept available in order to say that the end device violates this part of the specification:

image

To me at least, it seems that in this case the equivalent LinkADRReq has DataRateIndex 15 behavior, which would make the equivalent LinkADRReq command invalid.

adriansmares commented 1 year ago

Per offline consensus, we will ensure that we do not disable the wide join channel as part of the CFList if the Join Accept arrived via the wide channel.

Another complication that we need to take care of: If we do not disable the wide channel in the Join Accept, the end device will use the wide channel for subsequent uplinks. Our current LinkADRReq generator will attempt to disable the wide channels, but maintain the current data rate index, which obviously results in an invalid command. We need to also steer the end device to a 125KHz data rate as well in such cases.