Open conanomori opened 1 month ago
A Class C downlink frame not answering an uplink must have the fields indicated in the example below. The transmission will be delayed until the Station finds a convenient time slot. If there is no such opportunity, the frame will be dropped after some configurable time; the default is one (1) second.
A Class C downlink frame that answers an uplink and is aimed at RX1, looks almost identical to Class A dnmsg messages. The only difference is the field dC, which declares a Class C interaction. If the transmission cannot make the RX1 opportunity, the Station will switch over to RX2 parameters and choose a convenient time to send the frame. There is a maximum time Station will postpone transmission before the frame is dropped. Since RX2 downlink opportunities are arbitrary for class C devices,
So according to the LoRa Basics Station documentation, the gateway by default holds the Downlink for 1 sec (configurable) if there are collisions for RX2, and if the downlink is aimed at RX1, then there is one retry in the next RX2 window.
The problem this causes is that if there is a scheduling conflict on the gateway, it does not reattempt like expected for Class C downlinks, but rather rejects it after the first attempt.
Is this the reattempt that you are referring to?
I believe this is caused by the fact that the Basics Station formatter FromDownlink does not have a Class C implementation, and therefore defaults to Class A downlinks. As you can see in the code, the downlink DeviceClass is formatted either only Class A or Class B.
This is by design. The Things Stack as an LNS has the best context on current and future downlinks for each gateway and it can calculate collisions and also hold long downlinks in the future. So when the moment to send a Class C downlink arrives, The Things Stack calculates the time when the downlink should arrive (for absolute time downlinks) and subtracts the RX delay required and sends the downlink as a regular Class A, such that all the gateway has to do is to simply transmit. The Things Stack also accounts for RX windows and retries.
Thanks for the response.
Just to provide more clarification on what I am doing, I am trying to send multiple downlinks to multiple end devices in quick succession. I recognize that there may be scheduling conflicts (I can likely live with that), but would expect displaced packets to be reattempted.
Is this the reattempt that you are referring to?
I believe this is the same thing. The 1 second I ASSUME comes from these two parameters in the basics station code
ustime CLASS_C_BACKOFF_BY = "100ms" builtin retry interval for class C TX attempts
u4 CLASS_C_BACKOFF_MAX = 10 builtin max number of class C TX attempts
It will retry 10 times at 100ms intervals (10*100ms = 1s). However, it will only retry this IF its classified at a Class C downlink (i.e., dC = 2).
This is by design. The Things Stack as an LNS has the best context on current and future downlinks for each gateway and it can calculate collisions and also hold long downlinks in the future. So when the moment to send a Class C downlink arrives, The Things Stack calculates the time when the downlink should arrive (for absolute time downlinks) and subtracts the RX delay required and sends the downlink as a regular Class A, such that all the gateway has to do is to simply transmit. The Things Stack also accounts for RX windows and retries.
In my experience TTS is not able to reliably schedule the downlinks so that they reach the end nodes (currently I'm lucky to get 50% received rate which). But this particular implementation you are referring to, is this not the "Schedule Downlink Late" feature?
My understanding is that this is a legacy feature, and that modern gateways should have the capability to buffer and schedule messages accordingly (i.e., the current gateway I have implementing Basics Station)
For even further context, here are the actual error logs I get from the Basics Station service
[S2E:INFO] ::1 diid=26714 [ant#0] - displaces ::1 diid=26715 [ant#0] due to -20ms283us overlap
[S2E:VERB] ::1 diid=26715 [ant#0] - class A has no more alternate TX time
[S2E:WARN] ::1 diid=26715 [ant#0] - unable to place frame
And having a look through the code, the error message class A has no more alternate TX time
only prints if its interpreting the downlink as Class A.
But in any case, sending a Class C downlink as a Class A because it's assume that TTS can handle the scheduling doesnt make much sense to me. Perhaps I'm not understanding something here? Shouldn't TTS attempt to integrate with the Basics Station as intended? i.e., if Basics Station expects a Class C downlink, then TTS should have the ability to send it Class C downlinks.
My understanding is that this is a legacy feature, and that modern gateways should have the capability to buffer and schedule messages accordingly (i.e., the current gateway I have implementing Basics Station)
The gateway retry behaviour that you are referring to works if you are dealing with only one gateway. If you have multiple gateways that an end device can be served by (also if some of those are via peering with other networks), only the LNS has sufficient context to schedule downlinks effectively.
The LoRa Basics Station has no feedback mechanism to indicate to the LNS what the size of the buffer is and what is the current capacity. Without this feedback, the LNS may simply try to keep scheduling downlinks and many of them may fail because the gateway is trying to backoff and retry. But if the LNS controls the scheduling (as is the case with TTS), it has the full context of all the downlinks to be sent per gateway and it can switch to a different gateway (if available) if the current one has a potential collision. Again, this is about multiple paths through multiple gateways. A single gateway will never be aware of this context. The Things Stack is designed to control the downlink paths and hence we force downlink to look like Class A to any particular gateway.
And having a look through the code, the error message class A has no more alternate TX time only prints if its interpreting the downlink as Class A.
This simply means that there was a collision. If this was a Class C downlink, you would probably get a different error message after you've exhausted the retries. And during that time when you are still retrying, other downlinks might end up with collisions. As I see it, this is just a symptom of the problem and not the root cause.
In my experience TTS is not able to reliably schedule the downlinks so that they reach the end nodes (currently I'm lucky to get 50% received rate which).
Ok that's then the core issue that we need to understand. How many downlinks are you trying concurrently, which region are you in and do you have a separate antenna specific for downlinks?
First I'll respond to your comment then I'll answer your questions.
The gateway retry behaviour that you are referring to works if you are dealing with only one gateway. If you have multiple gateways that an end device can be served by (also if some of those are via peering with other networks), only the LNS has sufficient context to schedule downlinks effectively.
I understand that at a high level, the LNS takes control of this scheduling (and that only it has sufficient information to accomplish this due to multiple gateways etc), but once its distributed its downlinks to the gateways, shouldn't you also allow the gateway to make adjustments (i.e., reschedule Class C downlinks) as necessary to make it work? Like, the TTS does its best to schedule across all possible gateways, but once it gets to the intended gateway, locally the gateway can make minor adjustments.
But to your questions:
How many downlinks are you trying concurrently, which region are you in and do you have a separate antenna specific for downlinks?
Scenario: 5 Class C end nodes being serviced by one nearby gateway. I am trying to send a downlink to each one so that they receive the downlink as simultaneously as possible (I am aware of the limitations of LoRaWAN so am accepting of variations in this simultaneity). All are able to reliably send uplinks and receive downlink independently, however when I try to send 5 simultaneous downlinks, the only about 50% (so 2-3) devices will receive it.
How many downlinks concurrently: 1 downlink to each device = 5 total Region: US915 Separate Antenna: No (gateway is RAK7268)
Below are error logs from the Basics Station when I try to perform this task. As you can see, the gateway is receiving the downlinks, but are getting conflicts so are missing some.
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:DEBU] ::1 diid=50495 [ant#0] - next TX start ahead by 498ms58us (20:28:55.538932)
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:VERB] ::1 diid=50495 [ant#0] - starting TX in 49ms893us: 923.3MHz 26.0dBm ant#0(0) DR12 SF8/BW500 frame=60C6D4FE2780CA1802725665BF83 (14 bytes)
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [RAL:INFO] RAL_LGW: lgw_send done: count_us=768742484.
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-421 downlinkTqByAirtime: dr=4, timeonair=21, tm=1721248135
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-423 downlinkTqByPkt: dr=4, tm=1721248135
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [S2E:INFO] ::1 diid=50495 [ant#0] - displaces ::1 diid=50496 [ant#0] due to -26ms782us overlap
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:VERB] ::1 diid=50496 [ant#0] - class A has no more alternate TX time
Wed Jul 17 20:28:55 2024 user.warn basicstation[3620]: [S2E:WARN] ::1 diid=50496 [ant#0] - unable to place frame
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [S2E:INFO] TX ::1 diid=50495 [ant#0] - dntxed: 923.3MHz 26.0dBm ant#0(0) DR12 SF8/BW500 frame=60C6D4FE2780CA1802725665BF83 (14 bytes)
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:DEBU] Tx done diid=50495
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:DEBU] ::1 diid=50501 [ant#0] - next TX start ahead by 121ms165us (20:28:55.680798)
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:VERB] ::1 diid=50501 [ant#0] - starting TX in 49ms921us: 923.3MHz 26.0dBm ant#0(0) DR8 SF12/BW500 frame=60C4D4FE27805B0002EE45CDBEB4 (14 bytes)
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [RAL:INFO] RAL_LGW: lgw_send done: count_us=768884350.
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-421 downlinkTqByAirtime: dr=8, timeonair=289, tm=1721248135
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-423 downlinkTqByPkt: dr=8, tm=1721248135
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [S2E:INFO] ::1 diid=50501 [ant#0] - displaces ::1 diid=50504 [ant#0] due to -174ms695us overlap
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:VERB] ::1 diid=50504 [ant#0] - class A has no more alternate TX time
Wed Jul 17 20:28:55 2024 user.warn basicstation[3620]: [S2E:WARN] ::1 diid=50504 [ant#0] - unable to place frame
Wed Jul 17 20:28:55 2024 user.info basicstation[3620]: [S2E:INFO] TX ::1 diid=50501 [ant#0] - dntxed: 923.3MHz 26.0dBm ant#0(0) DR8 SF12/BW500 frame=60C4D4FE27805B0002EE45CDBEB4 (14 bytes)
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:DEBU] Tx done diid=50501
Wed Jul 17 20:28:55 2024 user.debug basicstation[3620]: [S2E:DEBU] ::1 diid=50507 [ant#0] - next TX start ahead by 160ms72us (20:28:56.088768)
Wed Jul 17 20:28:56 2024 user.debug basicstation[3620]: [S2E:VERB] ::1 diid=50507 [ant#0] - starting TX in 49ms921us: 923.3MHz 26.0dBm ant#0(0) DR12 SF8/BW500 frame=60C5D4FE2780020202A505521537 (14 bytes)
Wed Jul 17 20:28:56 2024 user.info basicstation[3620]: [RAL:INFO] RAL_LGW: lgw_send done: count_us=769292320.
Wed Jul 17 20:28:56 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-421 downlinkTqByAirtime: dr=4, timeonair=21, tm=1721248136
Wed Jul 17 20:28:56 2024 user.debug basicstation[3620]: [LOG:DEBU] rrd_statistic_down-423 downlinkTqByPkt: dr=4, tm=1721248136
Wed Jul 17 20:28:56 2024 user.info basicstation[3620]: [S2E:INFO] TX ::1 diid=50507 [ant#0] - dntxed: 923.3MHz 26.0dBm ant#0(0) DR12 SF8/BW500 frame=60C5D4FE2780020202A505521537 (14 bytes)
Wed Jul 17 20:28:56 2024 user.debug basicstation[3620]: [S2E:DEBU] Tx done diid=50507
The core of the issue here is that the LBS spec currently does not have an immediate NACK mechanism to indicate to the LNS that a downlink cannot be scheduled so that the LNS can immediately retry other options. This is why the LNS keeps track of the downlinks for any particular gateway and controls the downlinks.
It will retry 10 times at 100ms intervals (10*100ms = 1s). However, it will only retry this IF its classified at a Class C downlink (i.e., dC = 2).
Like, the TTS does its best to schedule across all possible gateways, but once it gets to the intended gateway, locally the gateway can make minor adjustments.
1 second is not really a minor adjustment. TTS can schedule some Class C downlinks with a delay of 530 ms (this is default and configurable) and so when the gateway is retrying the current downlink, TTS can schedule the next one which may increase the chance of collisions.
I think we should focus on the issue of why your downlinks are failing first.
All are able to reliably send uplinks and receive downlink independently, however when I try to send 5 simultaneous downlinks, the only about 50% (so 2-3) devices will receive it.
Ok how are you scheduling them? The specific API calls with the timing would be helpful. When you say simultaneous, what is the gap between downlinks?
Sure! So when I am just doing individual testing:
When I am trying to send simultaneous downlinks to all end devices in the applications:
Currently we do not use any delay between updating each device so it happens 'simultaneously' (or as fast as the loop updates them). We have previously tested adding in a delay in the loop (up to 2 seconds delay per iteration) but did not see any improvement in the reliability.
Ok and what is the exact RPC call?
Here is the Azure function:
//This azure function has 3 purposes
//1: It processes every message from the devices connected to the IOT hub. If a message has activeMode = true, it updates the properties of all the device twins to have activeMode = true
//2: The property update of the device twins in the IOT hub triggers this function again to relay the new properties to the things stack server, which will active the actual devices
//3: A timer trigger will also activate the devices every night at midnight
[Function("ActivateDevicesFunction")]
public async Task Run([EventHubTrigger("events", Connection = "EVENTHUB_CONNECTION_STRING")] Azure.Messaging.EventHubs.EventData[] events, CToken token)
{
string apiKey = "";
//var stackEvents = new StackEvent?[events.Length];
UriBuilder uriBuilder = new();
List<string> twinIDs = new List<string>();
for (var i = 0; i < events.Length; i++)
{
bool activateDevice = false;
var ev = events[i];
var props = ev.Properties;
var deviceID = (string)props["deviceId"];
var stackEvent = new StackEvent?[1];
JsonElement JSON = JsonDocument.Parse(ev.Body).RootElement;
_logger.LogInformation(deviceID);
_logger.LogInformation(JSON.ToString()); //test
//maybe a better way to get the app id would be to have it included in every message and remove the need for a STACK_APPLICATION_ID env. variable
//the stack api key might work for every app id under the fpi base url
if (JSON.TryGetProperty("tags", out JsonElement tags)
&& tags.TryGetProperty("lorawan", out JsonElement lorawan)
&& lorawan.TryGetProperty("app-id", out JsonElement appId))
{
//_logger.LogInformation("getting app ID");//test
if (appId.GetString() == GetVariable("STACK_APPLICATION_ID_2"))//replace with switch case when more apps get added
{
AppIdentifiers = new ApplicationIdentifiers(GetVariable("STACK_APPLICATION_ID_2"));
apiKey = GetVariable("STACK_KEY_2");
}
}
else
{
AppIdentifiers = new ApplicationIdentifiers(GetVariable("STACK_APPLICATION_ID"));
apiKey = GetVariable("STACK_API_KEY");
}
uriBuilder = new(GetVariable("STACK_BASE_URL"));
uriBuilder.Path = Path.Combine(uriBuilder.Path, "as/applications", AppIdentifiers.ApplicationID, "packages/azureiothub/events");
if (JSON.TryGetProperty("properties", out JsonElement properties))
{
if (properties.TryGetProperty("reported", out JsonElement reported))
{
if (reported.TryGetProperty("decodedPayload", out JsonElement decodedPayload)
&& decodedPayload.TryGetProperty("data", out JsonElement data)
&& data[0].TryGetProperty("channel", out JsonElement channel)
&& data[0].TryGetProperty("value", out JsonElement value))
{
if (channel.GetInt32() == 75)
{
await UpdateTwinsAsync(value.GetInt32());
}
else
{
_logger.LogInformation("channel: " + channel.ToString());
_logger.LogInformation("value: " + value.ToString());
}
}
else
{
JSONError("properties.reported.decodedPayload.data.channel", _logger);
}
}
else if (properties.TryGetProperty("desired", out JsonElement desired))
{
if (desired.TryGetProperty("decodedPayload", out JsonElement decodedPayload)
&& decodedPayload.TryGetProperty("activeMode", out JsonElement activeMode))
{
if (activeMode.GetBoolean())
{
_logger.LogInformation("Sending activation message to " + deviceID);
activateDevice = true;
stackEvent[0] = new StackEvent(
new EndDeviceIdentifiers(AppIdentifiers, deviceID),
JsonDocument.Parse(ev.Body).RootElement,
props
);
}
else
{
_logger.LogInformation("isActive: " + activeMode.ToString());
}
}
else
{
JSONError("properties.activeMode.isActive", _logger);
}
}
}
else
{
JSONError("properties", _logger);
}
if (activateDevice)
{
var request = new HttpRequestMessage
{
RequestUri = uriBuilder.Uri,
Method = HttpMethod.Post,
Headers = {
{
"Authorization", $"Bearer {apiKey}"
}
},
Content = new StringContent(
JsonSerializer.Serialize(new StackEvents(stackEvent)),
Encoding.UTF8,
"application/json"
),
};
_logger.LogInformation(request.ToString());//test
_logger.LogInformation(JsonSerializer.Serialize(new StackEvents(stackEvent)));
_logger.LogInformation(token.ToString());
var response = await client.SendAsync(request, token);
response.EnsureSuccessStatusCode();
}
}
}
async Task UpdateTwinsAsync(int active)
{
string connectionString = GetVariable("IOTHUB_CONNECTION_STRING");
RegistryManager registry = RegistryManager.CreateFromConnectionString(connectionString);
var decodedPayload = new Dictionary<string, object> { { "activeMode", active == 1 } };
var query = registry.CreateQuery("SELECT * FROM devices");
while (query.HasMoreResults)
{
var page = await query.GetNextAsTwinAsync();
foreach (var twin in page)
{
string appId = twin.Tags["lorawan"]["app-id"];
if (appId == null)
{
appId = "test-app";
}
if (AppIdentifiers.ApplicationID == null || appId == AppIdentifiers.ApplicationID)
{
_logger.LogInformation("Updating device twin of " + twin.DeviceId + " in " + appId);
twin.Properties.Desired["decodedPayload"] = decodedPayload;
await registry.UpdateTwinAsync(twin.DeviceId, twin, twin.ETag);
}
}
}
}
Every time the IoT hub is updated with a new message (e.g., an uplink), it will check to see if the "reported" value contains an "active" message. If it does then it will propagate that "active" message to all the other devices in the application by updating the "desired" values. The use of the "reported" and "desired" device twin values is part of the Azure IoT Hub integration with TTS.
I'd need to see the contents of the exact JSON this function is sending actually.
Hi,
See an example of a JSON sent to the TTS from an Azure function to update send a downlink. Some identifying info removed.
{
"version": 2309,
"tags": {
"lorawan": {
"joinEui": "0000000000000001",
"provisioned": true,
"devEui": "XXXXXXXXXXXXXXXX",
"app-id": "X"
}
},
"properties": {
"desired": {
"decodedPayload": {
"activeMode": true
},
"rawDownlink": {
"confirmed": false
},
"$metadata": {
"$lastUpdated": "2024-09-03T20:56:54.495869Z",
"$lastUpdatedVersion": 221,
"decodedPayload": {
"$lastUpdated": "2024-09-03T20:56:54.495869Z",
"$lastUpdatedVersion": 221,
"activeMode": {
"$lastUpdated": "2024-09-03T20:56:54.495869Z",
"$lastUpdatedVersion": 221
}
},
"rawDownlink": {
"$lastUpdated": "2024-09-03T20:56:54.495869Z",
"$lastUpdatedVersion": 221,
"confirmed": {
"$lastUpdated": "2024-09-03T20:56:54.495869Z",
"$lastUpdatedVersion": 221
}
}
},
"$version": 221
}
}
}
Hey sorry for the delay in my response. I mean The Things Stack Downlink message (ex: https://www.thethingsindustries.com/docs/integrations/webhooks/scheduling-downlinks/#scheduling-downlinks).
No problem. Just to clarify we are using the IoT Hub integration (https://www.thethingsindustries.com/docs/integrations/cloud-integrations/azure-iot-hub/device-twin/), so differs slightly. I think in any case, TTS is failing to properly schedule these packets thats leading to collisions.
Summary
I am trying to send downlinks to a Class C device from TTS Cloud to a RAK7268 setup as a Basics Station. While the downlinks are being received, I believe they are being sent as Class A downlinks rather than Class C. The problem this causes is that if there is a scheduling conflict on the gateway, it does not reattempt like expected for Class C downlinks, but rather rejects it after the first attempt.
I believe this is caused by the fact that the Basics Station formatter
FromDownlink
does not have a Class C implementation, and therefore defaults to Class A downlinks. As you can see in the code, the downlinkDeviceClass
is formatted either only Class A or Class B.Here is a log of a downlink message received on the gateway after I sent a downlink from TTS Cloud to a Class C end device. As you can see the
dC
key is 0 (Class A) and should instead be 2 (Class C)See here for Basics Station documentation and expected downlink for a Class C device where
dC
is 2.Is there a potential workaround?
Steps to Reproduce
dC
key valueCurrent Result
Downlinks sent to Class C devices are classified as Class A downlinks
Expected Result
Downlinks sent to Class C devices are classified as Class C downlinks
Relevant Logs
No response
URL
No response
Deployment
The Things Stack Cloud
The Things Stack Version
No response
Client Name and Version
No response
Other Information
No response
Proposed Fix
No response
Contributing
Validation
Code of Conduct