Open johanstokking opened 5 years ago
Blocked by #76
@johanstokking @rvolosatovs the blocking issue (#76) was closed recently. How important is this issue? Can we schedule it for Q2 or Q3?
Indeed, this can now be implemented in NS.
Missing details are actual reliable metrics on how many TX failures are reported by gateways as compared to successful transmissions. Also we need to understand what the TX failure is, and if we can mitigate them already in Gateway Server.
In The Things Stack, the gateway is under control by the Gateway Server. We do not expect gateways to do any scheduling, conflict mitigation or resolution, duty cycle enforcement, etc etc. There are a few indeterministic aspects, like timing (TOO_LATE
), which we may mitigate as well with adjusting the schedule downlink late setting and RTT interpretation (take p90 instead of p50).
So, I'll get some data and report back here to see what we do with this issue.
Small update here: since https://github.com/TheThingsNetwork/lorawan-stack/pull/4506, NS will not try downlink to the same cluster through Packet Broker after Gateway Server failed to schedule. Currently, there's a relatively high amount of downlink errors in PB because of this.
So when we have 3.14.2 deployed on TTN and TTSC, we can start looking at PB downlink errors that are being reported.
For now, we can look at proportion of the following events:
ns.down.data.schedule.success
vs ns.down.data.schedule.fail
gs.tx.success
vs gs.tx.fail
Do we have numbers on these events published in TTN and TTSC? @htdvisser ?
So when we have 3.14.2 deployed on TTN and TTSC, we can start looking at PB downlink errors that are being reported.
Here are some stats from Packet Broker from October:
Any scheduling errors, like too late, duty-cycle limits and conflicts, are currently marked unknown errors.
What we could do is make these available as TxAcknowledgment
messages from PBA to NS. That brings us back to this issue; retrying them.
@adriansmares what are your thoughts on this?
This is a multi faceted issue, and we should first start by observing two types of failures:
NsGs.ScheduleDownlink
)HandleAck
Here are some stats, which are only related to failures from the UDP Packet Forwarder.
TTSC:
TTN:
These failures represent about 4% of the downlink traffic. We can break this now by error:
tmms
) from the gateway, and that beacon intervals should be skipped for scheduling. This basically would force the NS to skip gateways which are GPS enabled.What I think we should do:
DownlinkSent
only on TxAck success. I think we should also send DownlinkFailed
upstream on failure, with the reasonttnpb.TxAcknowledgement
via GsNs.ReportTxAck
This would allow the NS to handle ack/nack via PB, if we decide to do this in the future. Also we can push those upstream which is always niceOverall, I'd say that we can improve reporting, but transmission failures at gateway level are very hard to debug for us, and unless we get detail reports of failures we'll have to be reactive.
This issue is also a bit stuck on the evolution of the ecosystem: if Basic Station continues to experience a large growth in adoption, and it doesn't add scheduling failure reporting, we may as well close this issue.
2.Collision Packet
- This signals either a packet forwarder bug, or a scheduler bug. But given the design this shouldn't be possible
Or a gateway with multiple network servers.
What I think we should do:
- NS right now reports
DownlinkSent
only on TxAck success. I think we should also sendDownlinkFailed
upstream on failure, with the reason
Yes
- PBA could subscribe to downlink state changes and report these as
ttnpb.TxAcknowledgement
viaGsNs.ReportTxAck
This would allow the NS to handle ack/nack via PB, if we decide to do this in the future. Also we can push those upstream which is always nice
Yes.
We'd have to marshal the downlink message in the downlink token, or keep a cluster wide cache for the downlink message and only pass a correlation ID.
- GS PB upstream could send these transmission acknowledgements to PBA such that PBA could update the downlink status accurately. This would allow PB to know the state of the downlink not only at scheduling, but also at transmission time
Yes. PB API already supports the error codes. PBA would need to hear back from GS if there's an TX acknowledgment expected, so that PBA keeps silent when scheduling succeeds so that the downlink message delivery state comes from the GS reported TX acknowledgment. Or we need another delivery state for downlink; delivered (at PBA), scheduled (at GS), transmitted or failed (at GW).
- I was thinking that perhaps PBA could block for a short period of time, such that the forwarding gateway server could report a downlink scheduling failure. This is a bit of mismatch between synchronous (GS RPC) vs asynchronous (PB in general) so I wouldn't give this too much thought
The scheduling is already awaited and reported back. We shouldn't wait for the TX acknowledgment, we can do that asynchronously.
The blocking would be between NS and PBA though. Currently, PBA always reports a successful schedule response, whereas PBA should have to wait until there is a downlink message delivery state. We don't do that because Packet Broker is currently last in the downlink attempts and because with current API we can't synchronously respond to an asynchronous message which we can't deliver to a specific PBA instance. We may want to address this when we want to try multiple downlink paths through Packet Broker, a scenario that is quite relevant on TTSCE/TTNV2 at the moment.
Overall, I'd say that we can improve reporting, but transmission failures at gateway level are very hard to debug for us, and unless we get detail reports of failures we'll have to be reactive.
This issue is also a bit stuck on the evolution of the ecosystem: if Basic Station continues to experience a large growth in adoption, and it doesn't add scheduling failure reporting, we may as well close this issue.
Yes, on the gateway error side this is definitely the case.
OK, let's revisit this later. Looking at the error rates and the potential reasons, there's no action to be taken here. If anything, let's come up with a retry mechanism for downlink scheduled through PB. Whether that's really pending a gateway roundtrip is to be defined; we can more easily start with scheduling conflicts reported by the fNS.
We'd have to marshal the downlink message in the downlink token, or keep a cluster wide cache for the downlink message and only pass a correlation ID.
We this for cluster local TxAcks in the NS - the downlink is saved based on the correlation ID with a short TTL. When the ack comes back, the downlink is matched based on the correlation ID. So as long as the correlation ID can be retrieved back from PB, it can be done.
The blocking would be between NS and PBA though. Currently, PBA always reports a successful schedule response, whereas PBA should have to wait until there is a downlink message delivery state. We don't do that because Packet Broker is currently last in the downlink attempts and because with current API we can't synchronously respond to an asynchronous message which we can't deliver to a specific PBA instance. We may want to address this when we want to try multiple downlink paths through Packet Broker, a scenario that is quite relevant on TTSCE/TTNV2 at the moment.
:+1:
Summary:
Retry downlink on Tx failure.
Why do we need this?
To perform a best-effort of (un)confirmed downlink.
What is already there? What do you see now?
The GS scheduler already detects and avoids downlink conflicts.
We already have a retry on the end device to NS level for confirmed downlink.
What is missing? What do you want to see?
Regardless of our amazing GS downlink scheduler, the gateway may still send a Tx acknowledgment with error code. In that case, even for unconfirmed downlink, the GS should try again.
How do you propose to implement this?
Once a downlink fail acknowledgement is received, unset all pending MAC state fields and:
updateDataDownlinkTask
https://github.com/TheThingsNetwork/lorawan-stack/blob/36d9597c89dfab1ef96d56bdd0e508a518b875c2/pkg/networkserver/downlink.go#L138 for data downlink fail. If possible, recreatingMACState.QueuedResponses
would be bestQueuedJoinAccept
and add a downlink task https://github.com/TheThingsNetwork/lorawan-stack/blob/36d9597c89dfab1ef96d56bdd0e508a518b875c2/pkg/networkserver/grpc_gsns.go#L1185-L1193 for join accept failOriginal issue: https://github.com/TheThingsIndustries/lorawan-stack/issues/794 by @egourlao