No ns.down.transmission.success in case of Downlink transmission over LoRa Basics Station

beitler commented 3 months ago

Summary

When a downlink gets successfully transmitted over a LoRa Basics Station gateway (confirmed by seeing dntxed log entries on the gateway) the ns.down.transmission.success event and the down/sent message on MQTT are not emitted.

Steps to Reproduce

Schedule a downlink for a device within reach of a single packetforwarder gateway
Observe on the packetforwarder gateway the downlink is transmitted
Schedule a downlink for a device within reach of a single LoRa Basics Station gateway
Observe on the LoRa Basics Station gateway that the downlink is transmitted (by looking for dntxed log entries)

Current Result

When the downlink is successfully transmitted over a packetforwarder gateway (steps 1. and 2.), then we observe the ns.down.transmission.success event in the console and the down/sent message on MQTT. When the downlink is successfully transmitted over a LoRa Basics Station gateway (steps 3. and 4.), then we observe neither the ns.down.transmission.success event in the console nor the down/sent message on MQTT.

Expected Result

We expect the ns.down.transmission.success event in the console and the down/sent message on MQTT on a successful transmission irrespective of whether the transmission happened over a LoRa Basics Station or packetforwarder gateway.

Relevant Logs

see below

URL

No response

Deployment

The Things Stack Cloud

The Things Stack Version

3.31.0

Client Name and Version

No response

Other Information

Looking at the gs.txack.receive event in the gateway console, we observe that in the case of LoRa Basics Station, the downlink_message field is missing:

Packet Forwarder

LoRa Basics Station

``` { "name": "gs.txack.receive", "time": "2024-07-10T12:54:50.021794265Z", "identifiers": [ { "gateway_ids": { "gateway_id": "eui-fcc23dfffe0f229f", "eui": "FCC23DFFFE0F229F" } } ], "data": { "@type": "type.googleapis.com/ttn.lorawan.v3.TxAcknowledgment", "correlation_ids": [ "as:downlink:01J2EBBJ1KAWHXNCZ7K3SR6MJ5", "gs:tx_ack:01J2EBBYD53JFG8310PXYQHFY7", "gs:uplink:01J2EBBT8RV51TX7JXCX432KEX", "ns:transmission:01J2EBBTP09C2BXTQVATBBBWNM", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:ef1ad643-ec47-40f1-8906-41e26268a015" ], "downlink_message": { "raw_payload": "YGcjASeAFQABlJDo/iu1", "scheduled": { "data_rate": { "lora": { "bandwidth": 125000, "spreading_factor": 7, "coding_rate": "4/5" } }, "frequency": "868100000", "timestamp": 51946571, "downlink": { "tx_power": 16.15, "invert_polarization": true }, "concentrator_timestamp": "51946571000" }, "correlation_ids": [ "as:downlink:01J2EBBJ1KAWHXNCZ7K3SR6MJ5", "gs:tx_ack:01J2EBBYD53JFG8310PXYQHFY7", "gs:uplink:01J2EBBT8RV51TX7JXCX432KEX", "ns:transmission:01J2EBBTP09C2BXTQVATBBBWNM", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:ef1ad643-ec47-40f1-8906-41e26268a015" ] } }, "correlation_ids": [ "as:downlink:01J2EBBJ1KAWHXNCZ7K3SR6MJ5", "gs:tx_ack:01J2EBBYD53JFG8310PXYQHFY7", "gs:uplink:01J2EBBT8RV51TX7JXCX432KEX", "ns:transmission:01J2EBBTP09C2BXTQVATBBBWNM", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:ef1ad643-ec47-40f1-8906-41e26268a015" ], "origin": "ip-10-23-14-96.eu-west-1.compute.internal", "context": { "tenant-id": "CgdzZW10ZWNo" }, "visibility": { "rights": [ "RIGHT_GATEWAY_TRAFFIC_READ" ] }, "unique_id": "01J2EBBYD5ZC0GX55AW1VHGVHK" } ```

```swift { "name": "gs.txack.receive", "time": "2024-07-10T11:33:33.340691098Z", "identifiers": [ { "gateway_ids": { "gateway_id": "eui-fcc23dfffe0f229f", "eui": "FCC23DFFFE0F229F" } } ], "data": { "@type": "type.googleapis.com/ttn.lorawan.v3.TxAcknowledgment", "correlation_ids": [ "gs:tx_ack:01J2E6Q40WB1V051XX7130XD1C" ] }, "correlation_ids": [ "gs:tx_ack:01J2E6Q40WB1V051XX7130XD1C" ], "origin": "ip-10-23-4-158.eu-west-1.compute.internal", "context": { "tenant-id": "CgdzZW10ZWNo" }, "visibility": { "rights": [ "RIGHT_GATEWAY_TRAFFIC_READ" ] }, "unique_id": "01J2E6Q40WNSX1FMKZN1DDWVNQ" } ```

This likely means that in ToTxAck, where the msgtype:dntxed is correlated with the downlink, txAck.DownlinkMessage is likely not set:

https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/semtechws/lbslns/upstream.go#L484-L495

Which means that tokens.Get() fails to correlate the downlink from the diid.

A possible explanation for why the downlink cannot be retrieved from the io.DownlinkTokens cache is its small size relative to how fast the diid advances over time. We observed that scheduling two subsequent downlinks over the same LoRa Basics Station gateway yield diid values with quite large differences (in our example 40752 -> 40868 : 116 diids over just a few seconds).

The diid, (key into token cache) is advanced by 1 for each downlink: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L43-L53 but clipped down to a number between 0 and 16: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L27

Now, when the downlink is retrieved from diid, a wrong downlink will be taken from the cache. This is because diid advances by ~100 within ~5 seconds, so by the time the dntxed event is received from the gateway, the token cache has wrapped around multiple times and the check of the diid against the key (`item.key != token) will fail: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L57-L64

The choice of the DownlinkToken cache size implies that it was likely intended as a per-gateway cache, as it is the case in the packetforwarder backend implementation. However, in the case of LoRa Basics Station backend, the token cache seems to exist just once per Gateway Server: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/gatewayserver.go#L303

This seems to explain the observed behavior. What do you think?

Proposed Fix

Either extend the existing global Downlink Token cache to a size which allows to hold enough downlinks which are pending acknowledgement globally. Or move the Downlink Token cache into a per-gateway state object.

Contributing

[ ] I can help by doing more research.
[ ] I can help by implementing a fix after the proposal above is approved.
[ ] I can help by testing the fix before it's released.

Validation

[ ] The fix is tested in a staging environment.
[ ] The fix is documented in The Things Stack Documentation

Code of Conduct

[X] I agree to follow TTN's Community Code of Conduct.

KrishnaIyer commented 3 months ago

Your description of the problem seems valid. I think the difference is that in the legacy UDP packet forwarder we store downlink tokens per gateway: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/udp/udp.go#L626

And for LBS we do this for the entire instance. https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/semtechws/lbslns/format.go#L33

And 16 may not be sufficient. Thoughts @johanstokking?

johanstokking commented 3 months ago

io.DownlinkTokens is designed for use per gateway. We use it by the way also in other frontends, per gateway connection.

This would indeed be the obvious cause and would be good to fix.

KrishnaIyer commented 3 months ago

@beitler: https://github.com/TheThingsNetwork/lorawan-stack/tree/fix/7182-lbs-tx-ack You can use this branch to test a fix.

beitler commented 2 months ago

Thanks @KrishnaIyer . Could you please upgrade semtech.eu1.cloud.thethings.industries to that branch or create a temporary tenant with that branch? Then we can run some tests.

KrishnaIyer commented 2 months ago

semtech.eu1.cloud.thethings.industries that's the production environment and will be updated on the 26th Aug. You can either build this from source or I can give you a docker image when this PR gets merged.

beitler commented 2 months ago

Hi @KrishnaIyer ,

I did a test using a snapshot build off of the branch you provided:

Presence of txack event

The analysis confirms that the expected txack events are properly generated in my test environment. ⚠ However, this does not confirm that the bug is fixed. Under a low load scenario (as it is the case with my test server) also the current implementation would work, because the global token cache would not wrap around so fast.

I scheduled a downlink for a device with an active session. The downlink was successfully transmitted by Basics Station as indicated by the 'dntxed' station log entry: ``` 2024-08-06 11:58:06.625 [S2E:VERB] RX 868.5MHz DR5 SF7/BW125 snr=11.2 rssi=-88 xtime=0x4500005B749315 - updf mhdr=40 DevAddr=0000025B FCtrl=82 FCnt=2 FOpts=[0307] 9354127E..6CCD mic=990798392 (26 bytes) 2024-08-06 11:58:07.128 [S2E:DEBU] ::1 diid=4 [ant#0] - next TX start ahead by 4s493ms (11:58:11.622331) 2024-08-06 11:58:11.602 [S2E:VERB] ::1 diid=4 [ant#0] - starting TX in 19ms900us: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008002000162DC8E0660AF (15 bytes) 2024-08-06 11:58:11.627 [S2E:XDEB] DC EU band 100 blocked until 11:58:16.255 (txtime=11:58:11.622 airtime=46ms336us) 2024-08-06 11:58:11.627 [S2E:XDEB] DC dnchnl 0 blocked until 11:58:13.290 (txtime=11:58:11.622 airtime=46ms336us) 2024-08-06 11:58:11.627 [S2E:INFO] TX ::1 diid=4 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008002000162DC8E0660AF (15 bytes) 2024-08-06 11:58:11.668 [S2E:DEBU] Tx done diid=4 ``` In the Web Console I see a corresponding `ns.down.transmission.success` event: ``` { "name": "ns.down.transmission.success", "time": "2024-08-06T11:58:11.639575842Z", "identifiers": [ { "device_ids": { "device_id": "tabs-pb-1204", "application_ids": { "application_id": "test-app2" }, "dev_eui": "58A0CB0000201204", "join_eui": "58A0CB0001500000", "dev_addr": "0000025B" } } ], "data": { "@type": "type.googleapis.com/ttn.lorawan.v3.DownlinkMessage", "payload": { "m_hdr": { "m_type": "UNCONFIRMED_DOWN" }, "mic": "jgZgrw==", "mac_payload": { "f_hdr": { "dev_addr": "0000025B", "f_ctrl": { "adr": true }, "f_cnt": 2 }, "f_port": 1, "frm_payload": "Ytw=", "full_f_cnt": 2 } }, "end_device_ids": { "device_id": "tabs-pb-1204", "application_ids": { "application_id": "test-app2" }, "dev_eui": "58A0CB0000201204", "join_eui": "58A0CB0001500000", "dev_addr": "0000025B" }, "request": { "downlink_paths": [ { "uplink_token": "ChYKFAoIY29yZWNlbGwSCNymMv/+Xry9EJWm0tsFGgwIzp/ItQYQ4MPXrwIgiJT6+tMs" } ], "rx1_delay": 5, "rx1_data_rate": { "lora": { "bandwidth": 125000, "spreading_factor": 7, "coding_rate": "4/5" } }, "rx1_frequency": "868500000", "rx2_data_rate": { "lora": { "bandwidth": 125000, "spreading_factor": 12, "coding_rate": "4/5" } }, "rx2_frequency": "869525000", "frequency_plan_id": "EU_863_870" }, "correlation_ids": [ "as:downlink:01J4KRV38PSFDKJ894QX7DQKJ9", "gs:tx_ack:01J4KRVMNPH5DFD378FRVJXNTK", "gs:uplink:01J4KRVFSDP9TN4NP8CQJ6SA70", "ns:transmission:01J4KRVG8QM07HSRJRN9K7EEFD", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:01J4KRV38F0H3P4SQ46EJSGV0K", "rpc:/ttn.lorawan.v3.GsNs/HandleUplink:01J4KRVFSD2WE5ZQZQ2PMVS8J1", "rpc:/ttn.lorawan.v3.GsNs/ReportTxAcknowledgment:01J4KRVMNPM9ZGGRZP5H0E63K6", "rpc:/ttn.lorawan.v3.NsGs/ScheduleDownlink:01J4KRVG8QX87X4BBJNPBCHT1A" ], "session_key_id": "AZEnjKMi/xCr50TAsJWeGQ==" }, "correlation_ids": [ "as:downlink:01J4KRV38PSFDKJ894QX7DQKJ9", "gs:tx_ack:01J4KRVMNPH5DFD378FRVJXNTK", "gs:uplink:01J4KRVFSDP9TN4NP8CQJ6SA70", "ns:transmission:01J4KRVG8QM07HSRJRN9K7EEFD", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:01J4KRV38F0H3P4SQ46EJSGV0K", "rpc:/ttn.lorawan.v3.GsNs/HandleUplink:01J4KRVFSD2WE5ZQZQ2PMVS8J1", "rpc:/ttn.lorawan.v3.GsNs/ReportTxAcknowledgment:01J4KRVMNPM9ZGGRZP5H0E63K6", "rpc:/ttn.lorawan.v3.NsGs/ScheduleDownlink:01J4KRVG8QX87X4BBJNPBCHT1A" ], "origin": "a51b879e5025", "context": {}, "visibility": { "rights": [ "RIGHT_APPLICATION_TRAFFIC_READ" ] }, "unique_id": "01J4KRVMNQVRM6QXBXBMC5MYZ8" } ``` Over MQTT I observe an event with 'downlink_send` field present: ``` { "end_device_ids": { "device_id": "tabs-pb-1204", "application_ids": { "application_id": "test-app2" }, "dev_eui": "58A0CB0000201204", "join_eui": "58A0CB0001500000", "dev_addr": "0000025B" }, "correlation_ids": [ "as:downlink:01J4KRV38PSFDKJ894QX7DQKJ9", "gs:tx_ack:01J4KRVMNPH5DFD378FRVJXNTK", "gs:uplink:01J4KRVFSDP9TN4NP8CQJ6SA70", "ns:transmission:01J4KRVG8QM07HSRJRN9K7EEFD", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:01J4KRV38F0H3P4SQ46EJSGV0K", "rpc:/ttn.lorawan.v3.GsNs/HandleUplink:01J4KRVFSD2WE5ZQZQ2PMVS8J1", "rpc:/ttn.lorawan.v3.GsNs/ReportTxAcknowledgment:01J4KRVMNPM9ZGGRZP5H0E63K6", "rpc:/ttn.lorawan.v3.NsAs/HandleUplink:01J4KRVMNQ6Q0WDJM71GAVE5EY", "rpc:/ttn.lorawan.v3.NsGs/ScheduleDownlink:01J4KRVG8QX87X4BBJNPBCHT1A" ], "received_at": "2024-08-06T11:58:11.640005797Z", "downlink_sent": { "session_key_id": "AZEnjKMi/xCr50TAsJWeGQ==", "f_port": 1, "f_cnt": 2, "frm_payload": "r/4=", "correlation_ids": [ "as:downlink:01J4KRV38PSFDKJ894QX7DQKJ9", "gs:tx_ack:01J4KRVMNPH5DFD378FRVJXNTK", "gs:uplink:01J4KRVFSDP9TN4NP8CQJ6SA70", "ns:transmission:01J4KRVG8QM07HSRJRN9K7EEFD", "rpc:/ttn.lorawan.v3.AppAs/DownlinkQueueReplace:01J4KRV38F0H3P4SQ46EJSGV0K", "rpc:/ttn.lorawan.v3.GsNs/HandleUplink:01J4KRVFSD2WE5ZQZQ2PMVS8J1", "rpc:/ttn.lorawan.v3.GsNs/ReportTxAcknowledgment:01J4KRVMNPM9ZGGRZP5H0E63K6", "rpc:/ttn.lorawan.v3.NsGs/ScheduleDownlink:01J4KRVG8QX87X4BBJNPBCHT1A" ] } } ```

Per-Gateway `diid` count

So, next I connected two gateways and observed the behavior of the diid when scheduling downlinks. My observations are:

diid starts from 1 after a gateway re-connect
diid increments by 1 for a given gateway, even when downlinks take turns in being scheduled across both gateways

| DnFCnt | GW 1 diid | GW 2 diid | | --- | --- | --- | | `05` | 1 | - | | `06` | 2 | - | | `07` | 3 | - | | `08` | - | 1 | | | restart | | | `09` | 1 | - | | `0A` | 2 | - | | `0B` | - | 2 | | `0C` | - | 3 | | `0D` | - | 4 | | | restart | | | `0E` | 1 | - | | `0F` | 2 | - | | `10` | - | 5 | Here the corresponding gateway logs: ``` GW 1 ---- 2024-08-06 12:18:12.352 [S2E:INFO] TX ::1 diid=1 [ant#0] - dntxed: 868.1MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008505000352FF00..71448A23 (21 bytes) 2024-08-06 12:18:25.890 [S2E:INFO] TX ::1 diid=2 [ant#0] - dntxed: 867.9MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008506000353FF00..3006D275 (17 bytes) 2024-08-06 12:18:31.008 [S2E:INFO] TX ::1 diid=3 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800700016C845CA7827BC9 (16 bytes) -> restart 2024-08-06 12:19:11.838 [S2E:INFO] TX ::1 diid=1 [ant#0] - dntxed: 868.1MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008509000354FF00..CBE8F5D2 (17 bytes) 2024-08-06 12:19:24.757 [S2E:INFO] TX ::1 diid=2 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800A00016E5E12F99F1E84 (16 bytes) -> restart 2024-08-06 12:24:30.750 [S2E:INFO] TX ::1 diid=1 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800E000131F51DE4762751 (16 bytes) 2024-08-06 12:24:46.179 [S2E:INFO] TX ::1 diid=2 [ant#0] - dntxed: 867.1MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000850F000355FF00..F1B9C780 (21 bytes) GW 2 ---- 2024-08-06 12:18:44.564 [S2E:INFO] TX ::1 diid=1 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B02000080080001C12C1468C7210C (16 bytes) 2024-08-06 12:19:41.376 [S2E:INFO] TX ::1 diid=2 [ant#0] - dntxed: 867.9MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800B000187AB0751ED92A9 (16 bytes) 2024-08-06 12:23:53.488 [S2E:INFO] TX ::1 diid=3 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800C0001771D397DC3F235 (16 bytes) 2024-08-06 12:24:09.933 [S2E:INFO] TX ::1 diid=4 [ant#0] - dntxed: 868.3MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B020000800D00014ADB5964EB2FEE (16 bytes) 2024-08-06 12:25:06.533 [S2E:INFO] TX ::1 diid=5 [ant#0] - dntxed: 868.5MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=605B0200008510000350FF00..86A9B119 (21 bytes) ```

Conclusion

Given the behavior of the diid across two different gateways, I conclude that the issue is likely fixed. The per-gateway token cache will likely not wrap around for a given gateway, unless more than 16 downlinks are in flight at a given time.

Thank you @KrishnaIyer . Looking forward to the August 26th release. I suggest we test again in the high-load production environment, as that's where the issue was provoked in the first place.

KrishnaIyer commented 2 months ago

Thanks a lot @beitler and I hope that the issue is fixed. I'll keep this issue open until we test and confirm this works in The Things Stack Cloud.

KrishnaIyer commented 2 months ago

@beitler: This is now deployed to The Things Stack Cloud. Please check and let us know if it works as intended.

beitler commented 2 months ago

From what we have seen so far it works as expected. Thanks!

TheThingsNetwork / lorawan-stack