Investigate Gateway Server long schedule anytime delays

adriansmares commented 1 year ago

Summary

We are sometimes experiencing very long transmission delays when class C downlinks are scheduled without an absolute time. We should investigate why this happens, as generally the numbers are absurd (days, to months and sometimes even years), while the maximum delay should be a little over an hour (that's the duty cycle limit).

Steps to Reproduce

Unknown at this time.

Current Result

Delays of days to months.

Expected Result

Delays of at most one hour, and even those should be extremely rare. 1 hour delay means that the GS decided to schedule in the next possible window.

Relevant Logs

N/A.

URL

No response

Deployment

The Things Stack Cloud

The Things Stack Version

3.25.1

Client Name and Version

No response

Other Information

No response

Proposed Fix

No response

Contributing

[X] I can help by doing more research.
[X] I can help by implementing a fix after the proposal above is approved.
[ ] I can help by testing the fix before it's released.

Code of Conduct

[X] I agree to follow TTN's Community Code of Conduct.

johanstokking commented 1 year ago

If the culprit is indeed that GS's ScheduleAnytime() schedules downlink far in the future, then the cause could still be an already scheduled downlink that GS is scheduling after. Like indicated offline, https://github.com/TheThingsNetwork/lorawan-stack/blob/v3.25/pkg/gatewayserver/scheduling/scheduler.go#L438-L442.

The logs in #6221 will flag the downlink scheduled far in the future, but if the reason is that there's already one far in the future, we really would like to see that earlier downlink message logged.

Roaming could be another cause. Absolute time class C scheduling is not possible via roaming, but normal class C scheduling is.

Few things:

Is RX1 delay bounds checked? Packet Broker doesn't do it. On quick glance, PBA doesn't do it either (https://github.com/TheThingsNetwork/lorawan-stack/blob/v3.25/pkg/packetbrokeragent/translation.go#L712-L713) nor is rx1_delay validated on maximum value (https://github.com/TheThingsNetwork/lorawan-stack/blob/v3.25/api/lorawan.proto#L508-L509). This means that an RX1 delay of 1 year may be possible. I'm not sure, but I don't see why it isn't
Lots of class C "anytime" downlinks that cause multiple duty-cycle windows in the future to fill up? If our NS doesn't do it, that's ok, but roaming doesn't forbid this
Building on (2): In a 0.1% duty-cycled sub-band, there are 3.6s per hour available. The largest SF12 message takes 2.79s afaik, in any case, 1 per hour max. You "only" need over 8750 of these class C downlink messages scheduled to schedule a year in the future.

adriansmares commented 1 year ago

nor is rx1_delay validated on maximum value (https://github.com/TheThingsNetwork/lorawan-stack/blob/v3.25/api/lorawan.proto#L508-L509). This means that an RX1 delay of 1 year may be possible. I'm not sure, but I don't see why it isn't

We validate that only values possible in the enum are possible, and the enum goes to 15. The RPC between PBA and GS wouldn't be possible if you would use an undefined RX delay value.

Lots of class C "anytime" downlinks that cause multiple duty-cycle windows in the future to fill up? If our NS doesn't do it, that's ok, but roaming doesn't forbid this

~There are no class C downlinks via roaming today to trigger this. It is not possible to schedule class C downlinks via Packet Broker.~ Edit: It is theoretically possible to use an old downlink path, via a token, to schedule a class C downlink via Packet Broker. It's a bit weird, but not impossible. I doubt this is the trigger, but ok, let's not disqualify this yet.

Building on (2): In a 0.1% duty-cycled sub-band, there are 3.6s per hour available. The largest SF12 message takes 2.79s afaik, in any case, 1 per hour max. You "only" need over 8750 of these class C downlink messages scheduled to schedule a year in the future.

Class C downlinks in EU868 (where this was observed in practice) are done in the 10% duty cycle subband (the RX2 frequency). This means that you would need 875000 downlinks in a short period of time, which is a lot of class C downlinks, even for FUOTA.

I can agree that our lead here is that maybe the large delays are legitimate. I don't know how we would verify this hypothesis, but I can see that it could happen.

Should we limit the delay on the NS side, and on the GS side only log these large delays ? Any other thoughts on how to track this issue ?

johanstokking commented 1 year ago

Edit: It is theoretically possible to use an old downlink path, via a token, to schedule a class C downlink via Packet Broker. It's a bit weird, but not impossible. I doubt this is the trigger, but ok, let's not disqualify this yet.

This is how Class C downlink scheduling via Backend Interfaces works.

Class C downlinks in EU868 (where this was observed in practice) are done in the 10% duty cycle subband (the RX2 frequency). This means that you would need 875000 downlinks in a short period of time, which is a lot of class C downlinks, even for FUOTA.

There's nothing that prevents you from scheduling class C downlink in another sub-band.

Should we limit the delay on the NS side, and on the GS side only log these large delays ? Any other thoughts on how to track this issue ?

I would start by logging long delays in GS. Anything beyond the duty-cycle window is info-worthy for now.

adriansmares commented 1 year ago

There's nothing that prevents you from scheduling class C downlink in another sub-band.

Unlikely, but correct.

I would start by logging long delays in GS. Anything beyond the duty-cycle window is info-worthy for now.

I have updated the PR.

I start to wonder if allowing the delay to grow indefinitely is a good strategy. What happens now is that a downlink path may consume all of the downlinks of multiple end devices and grow basically indefinitely, even if alternatives exist. The fact that we could, by design, grow to delays of a day because the signal is really good from a singular gateway seems slightly bogus to me.

johanstokking commented 1 year ago

I start to wonder if allowing the delay to grow indefinitely is a good strategy. What happens now is that a downlink path may consume all of the downlinks of multiple end devices and grow basically indefinitely, even if alternatives exist. The fact that we could, by design, grow to delays of a day because the signal is really good from a singular gateway seems slightly bogus to me.

Yes indeed, we should definitely return a "resource exhausted" error from GS if scheduling would happen too far in the future. I would find it acceptable to make that part of closing this issue. I think we can conclude that there are valid reasons to schedule very far into the future, even if it is unlikely. BTW, I'm not surprised that the reasons are "unlikely" as we see this behavior not that often, right?

adriansmares commented 1 year ago

BTW, I'm not surprised that the reasons are "unlikely" as we see this behavior not that often, right?

That's right - it seems to be very rare.

adriansmares commented 1 year ago

A symptom of this issue, based on the logs, is that the gateways periodically reconnect. We observe that these very long scheduling delays occur after the gateway has reconnected.

When a gateway reconnects, the concentrator timestamp synchronization is erased.

It is also the case that class C downlinks are done generally via the uplink tokens of old uplinks sent by the end device. These uplink tokens contain old timestamps which are most likely too stale to be used for scheduling, but currently the GS will just use these timestamps in the absence of any other synchronization: https://github.com/TheThingsNetwork/lorawan-stack/blob/1c4220770c7889eab39b70c0281d19dce4a2ad85/pkg/gatewayserver/scheduling/scheduler.go#L256-L268

This mechanism of transferring timestamps via the uplink token exists in order to facilitate situations in which the uplink flow is running on another Gateway Server instance from the instance which has claimed the downlink flow. The problem is that it is hard to establish if the timestamps themselves are stale or not, or even if they are valid (there is no signature/encryption mechanism inside a cluster - we only encrypt the tokens that pass through roaming).

These timestamps (the 32 bit concentrator timestamp, and the 64 bit concentrator time which accounts for rollovers) are valid only within a single 'session' - if the gateway reconnects, or restarts, the old timestamps are invalid and irrelevant.

Time boxing the tokens is not really water tight, and imposes more limitations than it solves. At the same time, we shouldn't do extra networking on the hot path of the GS (storing the clock state on uplink, and retrieving it on downlink). A more resilient solution is to store the clock information into a shared cache, along with a subscription system.

We can save + publish the current clock state to a cache, asynchronously, and other instances will retrieve + subscribe when the gateway connects.

Implementation details:

Remove the server_time, concentrator_time and gateway_time from the UplinkToken message.

Create a GatewayClockPubSub with the following interface:

// GatewayClockPubSub allows PubSub access to the gateway concentrator clock updates.
type GatewayClockPubSub interface {
// Publish stores the latest known clock state and publishes it to subscribers.
Publish(
    ctx context.Context,
    ids *ttnpb.GatewayIdentifiers,
    timestamp uint32,
    concentratorTime scheduling.ConcentratorTime,
    serverTime time.Time,
    gatewayTime *time.Time,
) error
// Subscribe retrieves the last known clock state and subscribes to state updates.
Subscribe(
    ctx context.Context,
    ids *ttnpb.GatewayIdentifiers,
    f func(
        ctx context.Context,
        timestamp uint32,
        concentratorTime scheduling.ConcentratorTime,
        serverTime time.Time,
        gatewayTime *time.Time,
    ) error,
) error
}

Add a new channel and background task, similar to how the gateway connection stats works, which stores and publishes the clock state. Updates should be published only if the uplink stream is active and the downlink stream is inactive (to avoid loops).
When the gateway connects, another background task retrieves the last known state from the store and subscribes to updates. Updates should be subscribed to only if the downlink stream is active and the uplink stream is inactive (to avoid loops).

Only the UDP frontend needs to participate in this protocol. The stored last known state needs to have a TTL attached to it, in order to avoid storing the state forever.

Any thoughts and/or comments on this solution ?

johanstokking commented 1 year ago

Sounds good.

KrishnaIyer commented 1 year ago

Proposal looks good. From how I read this, no two instance will share this state right? The instance with the "connection"(upstream socket) writes to cache when disconnecting and the instance to which the gateway next "connects" to reads the state. If this understanding is right, I prefer the Save/Restore semantics instead of Pub/Sub since the later seems to indicate concurrent subscribers.

adriansmares commented 1 year ago

If this understanding is right, I prefer the Save/Restore semantics instead of Pub/Sub since the later seems to indicate concurrent subscribers.

The idea is that the instance which serves the downlink stream really subscribes to clock updates, as it does not have access to the timestamps at all - all of these come via the uplink stream. As such, you're not just saving and restoring these values once, but rather receive their updates contiguously on the downlink stream.

TheThingsNetwork / lorawan-stack