TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
960 stars 302 forks source link

Retry downlink on Tx failure #18

Open johanstokking opened 5 years ago

johanstokking commented 5 years ago

Summary:

Retry downlink on Tx failure.

Why do we need this?

To perform a best-effort of (un)confirmed downlink.

What is already there? What do you see now?

The GS scheduler already detects and avoids downlink conflicts.

We already have a retry on the end device to NS level for confirmed downlink.

What is missing? What do you want to see?

Regardless of our amazing GS downlink scheduler, the gateway may still send a Tx acknowledgment with error code. In that case, even for unconfirmed downlink, the GS should try again.

How do you propose to implement this?

Once a downlink fail acknowledgement is received, unset all pending MAC state fields and:

  1. call updateDataDownlinkTask https://github.com/TheThingsNetwork/lorawan-stack/blob/36d9597c89dfab1ef96d56bdd0e508a518b875c2/pkg/networkserver/downlink.go#L138 for data downlink fail. If possible, recreating MACState.QueuedResponses would be best
  2. recreate QueuedJoinAccept and add a downlink task https://github.com/TheThingsNetwork/lorawan-stack/blob/36d9597c89dfab1ef96d56bdd0e508a518b875c2/pkg/networkserver/grpc_gsns.go#L1185-L1193 for join accept fail

Original issue: https://github.com/TheThingsIndustries/lorawan-stack/issues/794 by @egourlao

johanstokking commented 5 years ago

Blocked by #76

nejraselimovic commented 3 years ago

@johanstokking @rvolosatovs the blocking issue (#76) was closed recently. How important is this issue? Can we schedule it for Q2 or Q3?

johanstokking commented 3 years ago

Indeed, this can now be implemented in NS.

johanstokking commented 3 years ago

Missing details are actual reliable metrics on how many TX failures are reported by gateways as compared to successful transmissions. Also we need to understand what the TX failure is, and if we can mitigate them already in Gateway Server.

In The Things Stack, the gateway is under control by the Gateway Server. We do not expect gateways to do any scheduling, conflict mitigation or resolution, duty cycle enforcement, etc etc. There are a few indeterministic aspects, like timing (TOO_LATE), which we may mitigate as well with adjusting the schedule downlink late setting and RTT interpretation (take p90 instead of p50).

So, I'll get some data and report back here to see what we do with this issue.

johanstokking commented 3 years ago

Small update here: since https://github.com/TheThingsNetwork/lorawan-stack/pull/4506, NS will not try downlink to the same cluster through Packet Broker after Gateway Server failed to schedule. Currently, there's a relatively high amount of downlink errors in PB because of this.

So when we have 3.14.2 deployed on TTN and TTSC, we can start looking at PB downlink errors that are being reported.


For now, we can look at proportion of the following events:

Do we have numbers on these events published in TTN and TTSC? @htdvisser ?

johanstokking commented 2 years ago

So when we have 3.14.2 deployed on TTN and TTSC, we can start looking at PB downlink errors that are being reported.

Here are some stats from Packet Broker from October:

Screenshot 2021-11-12 at 12 19 25

Any scheduling errors, like too late, duty-cycle limits and conflicts, are currently marked unknown errors.

What we could do is make these available as TxAcknowledgment messages from PBA to NS. That brings us back to this issue; retrying them.

@adriansmares what are your thoughts on this?

adriansmares commented 2 years ago

This is a multi faceted issue, and we should first start by observing two types of failures:

  1. Reported by the Gateway Server server:
    • These are scheduling conflicts as seen at schedule time (NsGs.ScheduleDownlink)
    • These are the ones that are currently PB is aware of
    • These are already mitigated at Network Server level for the GS downlink path
    • Due to the nature of PB, PBA downlink paths basically always succeed
  2. Reported by the Gateway:
    • These are scheduling errors as reported by the gateway
    • We only know about failures from the UDP packet forwarder. MQTT has no TxAck at all, and BasicStation has only success
    • These are the ones that currently on the GS is aware of - The PBA target in the GS has a noop implementation for HandleAck
    • The Network Server right now doesn't do anything about these (this is what this issue is about)
    • We can handle those are Gateway Server level most of the time, and handling this better transitively fixes this for PB too
    • In my experience, they look as configuration errors mostly

Here are some stats, which are only related to failures from the UDP Packet Forwarder.

TTSC: image

TTN: image

These failures represent about 4% of the downlink traffic. We can break this now by error:

  1. Beacon Collision
    • We can account for this in the Gateway Server. We basically should let the scheduler know that we have received a fine timestamp (tmms) from the gateway, and that beacon intervals should be skipped for scheduling. This basically would force the NS to skip gateways which are GPS enabled.
    • References https://github.com/TheThingsNetwork/lorawan-stack/issues/4853
  2. Collision Packet
    • This signals either a packet forwarder bug, or a scheduler bug. But given the design this shouldn't be possible
    • Needs some kind of reproduction steps, or at least that we find out what that gateway is doing
  3. Too early
  4. Too late
    • No that common
    • Already covered here
  5. TxFreq / TxPower / Unknown
    • Misconfiguration / literary unknown. Not much we can do.

What I think we should do:

Overall, I'd say that we can improve reporting, but transmission failures at gateway level are very hard to debug for us, and unless we get detail reports of failures we'll have to be reactive.

This issue is also a bit stuck on the evolution of the ecosystem: if Basic Station continues to experience a large growth in adoption, and it doesn't add scheduling failure reporting, we may as well close this issue.

johanstokking commented 2 years ago

2.Collision Packet

  • This signals either a packet forwarder bug, or a scheduler bug. But given the design this shouldn't be possible

Or a gateway with multiple network servers.

What I think we should do:

  • NS right now reports DownlinkSent only on TxAck success. I think we should also send DownlinkFailed upstream on failure, with the reason

Yes

  • PBA could subscribe to downlink state changes and report these as ttnpb.TxAcknowledgement via GsNs.ReportTxAck This would allow the NS to handle ack/nack via PB, if we decide to do this in the future. Also we can push those upstream which is always nice

Yes.

We'd have to marshal the downlink message in the downlink token, or keep a cluster wide cache for the downlink message and only pass a correlation ID.

  • GS PB upstream could send these transmission acknowledgements to PBA such that PBA could update the downlink status accurately. This would allow PB to know the state of the downlink not only at scheduling, but also at transmission time

Yes. PB API already supports the error codes. PBA would need to hear back from GS if there's an TX acknowledgment expected, so that PBA keeps silent when scheduling succeeds so that the downlink message delivery state comes from the GS reported TX acknowledgment. Or we need another delivery state for downlink; delivered (at PBA), scheduled (at GS), transmitted or failed (at GW).

  • I was thinking that perhaps PBA could block for a short period of time, such that the forwarding gateway server could report a downlink scheduling failure. This is a bit of mismatch between synchronous (GS RPC) vs asynchronous (PB in general) so I wouldn't give this too much thought

The scheduling is already awaited and reported back. We shouldn't wait for the TX acknowledgment, we can do that asynchronously.

The blocking would be between NS and PBA though. Currently, PBA always reports a successful schedule response, whereas PBA should have to wait until there is a downlink message delivery state. We don't do that because Packet Broker is currently last in the downlink attempts and because with current API we can't synchronously respond to an asynchronous message which we can't deliver to a specific PBA instance. We may want to address this when we want to try multiple downlink paths through Packet Broker, a scenario that is quite relevant on TTSCE/TTNV2 at the moment.

Overall, I'd say that we can improve reporting, but transmission failures at gateway level are very hard to debug for us, and unless we get detail reports of failures we'll have to be reactive.

This issue is also a bit stuck on the evolution of the ecosystem: if Basic Station continues to experience a large growth in adoption, and it doesn't add scheduling failure reporting, we may as well close this issue.

Yes, on the gateway error side this is definitely the case.


OK, let's revisit this later. Looking at the error rates and the potential reasons, there's no action to be taken here. If anything, let's come up with a retry mechanism for downlink scheduled through PB. Whether that's really pending a gateway roundtrip is to be defined; we can more easily start with scheduling conflicts reported by the fNS.

adriansmares commented 2 years ago

We'd have to marshal the downlink message in the downlink token, or keep a cluster wide cache for the downlink message and only pass a correlation ID.

We this for cluster local TxAcks in the NS - the downlink is saved based on the correlation ID with a short TTL. When the ack comes back, the downlink is matched based on the correlation ID. So as long as the correlation ID can be retrieved back from PB, it can be done.

The blocking would be between NS and PBA though. Currently, PBA always reports a successful schedule response, whereas PBA should have to wait until there is a downlink message delivery state. We don't do that because Packet Broker is currently last in the downlink attempts and because with current API we can't synchronously respond to an asynchronous message which we can't deliver to a specific PBA instance. We may want to address this when we want to try multiple downlink paths through Packet Broker, a scenario that is quite relevant on TTSCE/TTNV2 at the moment.

:+1: