Closed beitler closed 2 months ago
Your description of the problem seems valid. I think the difference is that in the legacy UDP packet forwarder we store downlink tokens per gateway: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/udp/udp.go#L626
And for LBS we do this for the entire instance. https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/semtechws/lbslns/format.go#L33
And 16 may not be sufficient. Thoughts @johanstokking?
io.DownlinkTokens
is designed for use per gateway. We use it by the way also in other frontends, per gateway connection.
This would indeed be the obvious cause and would be good to fix.
@beitler: https://github.com/TheThingsNetwork/lorawan-stack/tree/fix/7182-lbs-tx-ack You can use this branch to test a fix.
Thanks @KrishnaIyer . Could you please upgrade semtech.eu1.cloud.thethings.industries to that branch or create a temporary tenant with that branch? Then we can run some tests.
semtech.eu1.cloud.thethings.industries
that's the production environment and will be updated on the 26th Aug. You can either build this from source or I can give you a docker image when this PR gets merged.
Hi @KrishnaIyer ,
I did a test using a snapshot build off of the branch you provided:
The analysis confirms that the expected txack events are properly generated in my test environment. ⚠ However, this does not confirm that the bug is fixed. Under a low load scenario (as it is the case with my test server) also the current implementation would work, because the global token cache would not wrap around so fast.
diid
countSo, next I connected two gateways and observed the behavior of the diid
when scheduling downlinks. My observations are:
diid
starts from 1 after a gateway re-connectdiid
increments by 1 for a given gateway, even when downlinks take turns in being scheduled across both gatewaysGiven the behavior of the diid
across two different gateways, I conclude that the issue is likely fixed. The per-gateway token cache will likely not wrap around for a given gateway, unless more than 16 downlinks are in flight at a given time.
Thank you @KrishnaIyer . Looking forward to the August 26th release. I suggest we test again in the high-load production environment, as that's where the issue was provoked in the first place.
Thanks a lot @beitler and I hope that the issue is fixed. I'll keep this issue open until we test and confirm this works in The Things Stack Cloud.
@beitler: This is now deployed to The Things Stack Cloud. Please check and let us know if it works as intended.
From what we have seen so far it works as expected. Thanks!
Summary
When a downlink gets successfully transmitted over a LoRa Basics Station gateway (confirmed by seeing dntxed log entries on the gateway) the
ns.down.transmission.success
event and thedown/sent
message on MQTT are not emitted.Steps to Reproduce
Current Result
When the downlink is successfully transmitted over a packetforwarder gateway (steps 1. and 2.), then we observe the
ns.down.transmission.success
event in the console and thedown/sent
message on MQTT. When the downlink is successfully transmitted over a LoRa Basics Station gateway (steps 3. and 4.), then we observe neither thens.down.transmission.success
event in the console nor thedown/sent
message on MQTT.Expected Result
We expect the
ns.down.transmission.success
event in the console and thedown/sent
message on MQTT on a successful transmission irrespective of whether the transmission happened over a LoRa Basics Station or packetforwarder gateway.Relevant Logs
URL
No response
Deployment
The Things Stack Cloud
The Things Stack Version
3.31.0
Client Name and Version
No response
Other Information
Looking at the
gs.txack.receive
event in the gateway console, we observe that in the case of LoRa Basics Station, thedownlink_message
field is missing:This likely means that in
ToTxAck
, where themsgtype:dntxed
is correlated with the downlink,txAck.DownlinkMessage
is likely not set:https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/semtechws/lbslns/upstream.go#L484-L495
Which means that
tokens.Get()
fails to correlate the downlink from thediid
.A possible explanation for why the downlink cannot be retrieved from the
io.DownlinkTokens
cache is its small size relative to how fast thediid
advances over time. We observed that scheduling two subsequent downlinks over the same LoRa Basics Station gateway yielddiid
values with quite large differences (in our example 40752 -> 40868 : 116diid
s over just a few seconds).The
diid
, (key
into token cache) is advanced by 1 for each downlink: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L43-L53 but clipped down to a number between 0 and 16: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L27Now, when the downlink is retrieved from
diid
, a wrong downlink will be taken from the cache. This is becausediid
advances by ~100 within ~5 seconds, so by the time thedntxed
event is received from the gateway, the token cache has wrapped around multiple times and the check of thediid
against thekey
(`item.key != token) will fail: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/io/downlink_token.go#L57-L64The choice of the DownlinkToken cache size implies that it was likely intended as a per-gateway cache, as it is the case in the packetforwarder backend implementation. However, in the case of LoRa Basics Station backend, the token cache seems to exist just once per Gateway Server: https://github.com/TheThingsNetwork/lorawan-stack/blob/5fdde221469e2af60c1513eb927632062207f9df/pkg/gatewayserver/gatewayserver.go#L303
This seems to explain the observed behavior. What do you think?
Proposed Fix
Either extend the existing global Downlink Token cache to a size which allows to hold enough downlinks which are pending acknowledgement globally. Or move the Downlink Token cache into a per-gateway state object.
Contributing
Validation
Code of Conduct