TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
980 stars 309 forks source link

last_downlink_at and last_network_initiated_downlink_at get set in future. #4990

Closed onizmx closed 2 years ago

onizmx commented 2 years ago

Summary

Multicast downlinks get queued / not being consumed. When this happens, we see both fields of the multicast end device get set as below:

"last_network_initiated_downlink_at": "2022-06-10T22:52:30.945586212Z" "last_downlink_at": "2022-06-10T22:52:30.945586212Z"

Frame counters do not match too: last_n_f_cnt_down": 923122 last_a_f_cnt_down": 923246

As a quick fix, we manually clear the queue and delete / recreate the device. ...

Steps to Reproduce

Not sure about the condition. The frequency of this event is about 2 ~ 3 times in a week after upgrading to v3.16.0

What do you see now?

These are all different devices.

Case 1: Happened around 2021-12-10T11:49:12Z "last_network_initiated_downlink_at": "2022-06-03T05:47:55.916153273Z", "last_downlink_at": "2022-06-03T05:47:55.916153273Z"

Case 2: Happened around 2021-12-11T13:42:57Z "last_network_initiated_downlink_at": "2023-10-04T02:07:36.917262852Z" "last_downlink_at": "2023-10-04T02:07:36.917262852Z"

Case 3: Happened around 2021-12-14T06:27:10Z "last_network_initiated_downlink_at": "2022-06-10T22:52:30.945586212Z" "last_downlink_at": "2022-06-10T22:52:30.945586212Z" ...

What do you want to see instead?

last_downlink_at & last_network_initiated_downlink_at not getting set to the future. Multicasts not getting blocked. ...

Environment

v3.16.0 ...

How do you propose to implement this?

Need guidance ...

How do you propose to test this?

Need guidance ...

Can you do this yourself and submit a Pull Request?

Need guidance ...

adriansmares commented 2 years ago

We need more details regarding this issue, as it is very hard to debug as is. Are these multicast absolute time downlinks ?

The two fields that get set into the (distant) future are based on the delay provided by Gateway Server. It is unclear at the moment if the Gateway Server promises to send the downlink in 2022 or the Network Server requests that the downlink is sent in 2022.

Can you lookup in your DEBUG logs for messages with the following text Scheduled downlink ? They contain the transmission_delay and transmit_at, and I'm curious what they contain.

onizmx commented 2 years ago

@adriansmares Yes, we do pass absolute_time for the multicast downlinks, and was not able to find any outliers (mostly +10 ~ +60 seconds from now(). We have just enabled the debug log today, and we will send the logs when this happens again. It is highly likely to happen again soon since we had one more of this issue after opening this ticket.

onizmx commented 2 years ago

@adriansmares I have the log, it looks like transmission_delay is way off

"started_at": 1639763483.8760054, "transmission_delay": 57688384.03355046, "transmit_at": 1697451867.9144053

adriansmares commented 2 years ago

Sorry for the delayed response. I'm pretty much confident that this is a bug, and given the (unfortunate) high reproducibility rate, we probably can track it down.

The first step I recommend is to enhance the logging details for the downlink. This diff should bring some extra log messages and fields:

diff --git a/pkg/gatewayserver/scheduling/scheduler.go b/pkg/gatewayserver/scheduling/scheduler.go
index 0e9c12eb4..af5a51943 100644
--- a/pkg/gatewayserver/scheduling/scheduler.go
+++ b/pkg/gatewayserver/scheduling/scheduler.go
@@ -299,6 +299,10 @@ func (s *Scheduler) ScheduleAt(ctx context.Context, opts Options) (Emission, err
                        medianRTT = &median
                }
        }
+       log.FromContext(ctx).WithFields(log.Fields(
+               "median_rtt", medianRTT,
+               "min_schedule_time", minScheduleTime,
+       )).Debug("Computed scheduling delays")
        var starts ConcentratorTime
        now, ok := s.clock.FromServerTime(s.timeSource.Now())
        if opts.Time != nil {
@@ -320,6 +324,10 @@ func (s *Scheduler) ScheduleAt(ctx context.Context, opts Options) (Emission, err
                if err != nil {
                        return Emission{}, err
                }
+               log.FromContext(ctx).WithFields(log.Fields(
+                       "toa", toa,
+                       "time", opts.Time,
+               )).Debug("Computed downlink time-on-air")
                starts -= ConcentratorTime(toa)
        } else {
                starts = s.clock.FromTimestampTime(opts.Timestamp)
@@ -330,6 +338,12 @@ func (s *Scheduler) ScheduleAt(ctx context.Context, opts Options) (Emission, err
                                "delta", delta,
                                "min", minScheduleTime,
                        )
+               } else {
+                       log.FromContext(ctx).WithFields(log.Fields(
+                               "now", now,
+                               "starts", starts,
+                               "delta", delta,
+                       )).Debug("Computed downlink start timestamp")
                }
        }
        sb, err := s.findSubBand(opts.Frequency)
diff --git a/pkg/networkserver/downlink.go b/pkg/networkserver/downlink.go
index 774f7d4bc..7fb14820a 100644
--- a/pkg/networkserver/downlink.go
+++ b/pkg/networkserver/downlink.go
@@ -980,6 +980,7 @@ func (ns *NetworkServer) scheduleDownlinkByPaths(ctx context.Context, req *sched
                logger.WithFields(log.Fields(
                        "transmission_delay", delay,
                        "transmit_at", transmitAt,
+                       "absolute_time", ttnpb.StdTime(req.TxRequest.AbsoluteTime),
                )).Debug("Scheduled downlink")
                queuedEvents = append(queuedEvents, events.Builders(append([]events.Builder{
                        successEvent.With(

In the mean time, any chance you could share some light on your setup ? Gateway models, are they equipped with GPS, did you check if the uplinks that the gateway receives have a good timestamp in the settings/rx_metadata ? What are the round trip times ?

You can check the RTTs using the CLI:

adriansmares@A98BCD2222F1:~/go/src/go.thethings.network/lorawan-stack$ ttn-lw-cli gateways get-connection-stats multitech-adrian-2
{
  "connected_at": "2021-12-20T15:33:16.357015427Z",
  "protocol": "ws",
  "last_status_received_at": "2021-12-20T15:33:16.390608761Z",
  "last_status": {
    "time": "2021-12-20T15:33:16.390394503Z",
    "versions": {
      "package": "",
      "platform": "mlinux - Firmware  - Protocol 2",
      "station": "2.0.5(mlinux/std)",
      "firmware": ""
    },
    "advanced": {
      "features": "rmtsh prod gps",
      "model": "mlinux"
    }
  },
  "last_uplink_received_at": "2021-12-20T20:44:27.362554376Z",
  "uplink_count": "1047",
  "round_trip_times": {
    "min": "0.046192544s",
    "max": "0.098604816s",
    "median": "0.060346122s",
    "count": 20
  }
}
virtualguy commented 2 years ago

We will try and get a patched build running with the additional logging shortly. Its happening approximately daily for one multicast device at a time, no patterns emerging yet (always different gateways, sites and multicast devices).

It would be great to catch the bug, it would also be good if the queue could be evaluated and messages with old/expired timestamps be dropped to avoid it blocking forever.

Gateways affected are all Tektelic Kona Macro with gps timestamps enabled (reports tmms but not tmst, as tektelic decided to ship one or the other).

Obviously not the rtt times from during the incident but this from one of the gateways on the site and is typical for our network. Note that we schedule (2-4) back to back multicast messages from each gateway using absolute time 1000ms apart to ensure coverage across the site.

ttn-lw-cli g get-connection-stats eui-647fdaxxxxxxxx
{
  "connected_at": "2021-12-16T05:14:24.012737112Z",
  "protocol": "udp",
  "last_status_received_at": "2021-12-21T11:33:00.022819028Z",
  "last_status": {
    "time": "2021-12-21T11:32:59Z",
    "versions": {
      "ttn-lw-gateway-server": "3.16.0"
    },
    "antenna_locations": [
      {
        "latitude": -38.0xxxx,
        "longitude": 175.3xxxx,
        "altitude": 94
      }
    ],
    "ip": [
      "124.248.xxx.xxx"
    ],
    "metrics": {
      "ackr": 100,
      "rxfw": 42,
      "rxin": 53,
      "rxok": 42,
      "temp": 26,
      "txin": 11,
      "txok": 9
    }
  },
  "last_uplink_received_at": "2021-12-21T11:33:17.713853948Z",
  "uplink_count": "802361",
  "last_downlink_received_at": "2021-12-21T11:33:17.982146049Z",
  "downlink_count": "177359",
  "round_trip_times": {
    "min": "0.044211425s",
    "max": "0.125106136s",
    "median": "0.062439571s",
    "count": 20
  },
  "sub_bands": [
    {
      "max_frequency": "18446744073709551615",
      "downlink_utilization_limit": 1,
      "downlink_utilization": 0.009626809
    }
  ]
}
adriansmares commented 2 years ago

It would be great to catch the bug, it would also be good if the queue could be evaluated and messages with old/expired timestamps be dropped to avoid it blocking forever.

This should already be happening - is it not ? Keep in mind that in the current issue it's not necessarily an old message that stalls the queue, but rather the bogus timestamps that stop the scheduling.

onizmx commented 2 years ago

@adriansmares Sorry for the late response. We are running 3.16.3 now and still seeing the issue around 2 times a week. Is the patch https://github.com/TheThingsNetwork/lorawan-stack/issues/4990#issuecomment-998257391 still applicable for 3.16.3?

adriansmares commented 2 years ago

@adriansmares Sorry for the late response. We are running 3.16.3 now and still seeing the issue around 2 times a week. Is the patch #4990 (comment) still applicable for 3.16.3?

Hi @onizmx - the patch should still be usable, as it only adds some logging statements and fields. Let me know if you cannot apply it.

onizmx commented 2 years ago

@adriansmares We were able to deploy the patched version yesterday, and here are the extra logs.

@ingestionTime | 1645480875538
INFO Computed downlink time-on-air {"gateway_eui": "647FDAFFFE00FFC6", "gateway_uid": "eui-647fdafffe00ffc6", "namespace": "gatewayserver/io/udp", "protocol": "udp", "remote_addr": "172.18.0.1:54292", "time": "2022-02-21T22:01:16Z", "toa": 0.123392}

@ingestionTime | 1645480875538
INFO Computed scheduling delays {"gateway_eui": "647FDAFFFE00FFC6", "gateway_uid": "eui-647fdafffe00ffc6", "median_rtt": 0.057939454, "min_schedule_time": 0.064437231, "namespace": "gatewayserver/io/udp", "protocol": "udp", "remote_addr": "172.18.0.1:54292"}

@ingestionTime | 1645480875538
INFO Scheduled downlink {"absolute_time": 1645480876, "attempt_rx1": false, "attempt_rx2": true, "band_id": "AU_915_928", "dev_addr": "01490755", "device_class": "CLASS_C", "device_uid": "halter.2b150007-a139-4737-8864-5198daa6ffab", "downlink_class": "CLASS_C", "downlink_priority": "ABOVE_NORMAL", "downlink_type": "data", "frequency_plan": "AU_915_928_FSB_2_AND_FSB_3", "frequency_plan_id": "AU_915_928_FSB_2_AND_FSB_3", "namespace": "networkserver", "rx2_data_rate": "lora:<bandwidth:500000 spreading_factor:10 > ", "rx2_frequency": 922100000, "started_at": 1645480874.7114365, "transmission_delay": 17466521.257917337, "transmit_at": 1662947395.9710467}

Nothing gets logged for Computed downlink start timestamp. transmission_delay looks pretty massive. Let us know if you need more details.

adriansmares commented 2 years ago

Nothing gets logged for Computed downlink start timestamp. transmission_delay looks pretty massive. Let us know if you need more details.

This should be impossible. It means that either you should have seen an err_too_late, which you didn't, or that the clock is marked as synced (s.clock.IsSynced()) but the server time is nil (since the message won't be printed if ok is false, and ok is false only if the server time is nil). That's again not possible since each time the clock is marked as synced, the server time is recorded.

@onizmx do you have this log ? https://github.com/TheThingsNetwork/lorawan-stack/blob/78d2db0cb6b3e8458c44bea254acff82be5a2921/pkg/gatewayserver/io/io.go#L615-L620

The delay most likely should match the transmission_delay.

@johanstokking can you think of a trace in which the clock is marked as synced but there is no server time ? how would that even happen ?

The clock must be synced https://github.com/TheThingsNetwork/lorawan-stack/blob/78d2db0cb6b3e8458c44bea254acff82be5a2921/pkg/gatewayserver/scheduling/scheduler.go#L291-L293

But this branch is not executed https://github.com/TheThingsNetwork/lorawan-stack/blob/78d2db0cb6b3e8458c44bea254acff82be5a2921/pkg/gatewayserver/scheduling/scheduler.go#L335-L348

How would that be possible ? I don't find any update to the clock that is not protected by the mutex. It doesn't make any sense.

johanstokking commented 2 years ago

Nothing gets logged for Computed downlink start timestamp. transmission_delay looks pretty massive. Let us know if you need more details.

This should be impossible

Stupid question maybe, but is debug logging enabled?

onizmx commented 2 years ago

@johanstokking No, we were shipping too many production logs so ended up changing the requested log lines to info. So unfortunately @adriansmares we do not have the gatewayserver io debug log. Lately, we've been a bit cautious of upgrading / redeploying the stack, therefore unable to confirm if we can get the extra (gatewayserver io) logs soon. I will export cloudwatch logs next time when this happens and send an email.

To be clear, here is the commit for our current production https://github.com/halter-corp/lorawan-stack/commit/cf5cf1d03631ca37541fc5f4e12dac1f02adf372

adriansmares commented 2 years ago

Something is slightly weird about your commit path - the commit that adds the logging as DEBUG is https://github.com/halter-corp/lorawan-stack/commit/1ed51d346f5ec95634408603fdf30fb65af12d7f but it doesn't contain the log statement for Computed downlink start timestamp in ScheduleAt - @onizmx can you add that one to your next release ?

onizmx commented 2 years ago

@adriansmares I have added that log as an info here https://github.com/halter-corp/lorawan-stack/commit/cf5cf1d03631ca37541fc5f4e12dac1f02adf372 And this (the last commit of 'halter-release') is in production.

onizmx commented 2 years ago

@adriansmares ah I see I added the log in the wrong place. Sorry for the confusion. Yes I will add the log for the next release.

onizmx commented 2 years ago

@adriansmares We've managed to deploy 3.18.1 last week, and here are the logs.

Network Server
@ingestionTime | 1647582575119
INFO Scheduled downlink {"absolute_time": 1647582579, "attempt_rx1": false, "attempt_rx2": true, "band_id": "AU_915_928", "dev_addr": "01FE28FD", "device_class": "CLASS_C", "device_uid": "halter.fd0f383d-bcdd-4685-9375-d953bdd5ee5c", "downlink_class": "CLASS_C", "downlink_priority": "ABOVE_NORMAL", "downlink_type": "data", "frequency_plan": "AU_915_928_FSB_2_AND_FSB_3", "frequency_plan_id": "AU_915_928_FSB_2_AND_FSB_3", "namespace": "networkserver", "rx2_data_rate": "lora:<bandwidth:500000 spreading_factor:10 > ", "rx2_frequency": 922100000, "started_at": 1647582573.8024268, "transmission_delay": 21292051.162028152, "transmit_at": 1668874624.9708872}

Gateway Server
@ingestionTime | 1647582575119
INFO Computed scheduling delays {"gateway_eui": "647FDAFFFE01004D", "gateway_uid": "eui-647fdafffe01004d", "median_rtt": 0.072877255, "min_schedule_time": 0.08233109, "namespace": "gatewayserver/io/udp", "protocol": "udp", "remote_addr": "172.18.0.1:40570"}

@ingestionTime | 1647582575119
INFO Computed downlink time-on-air {"gateway_eui": "647FDAFFFE01004D", "gateway_uid": "eui-647fdafffe01004d", "namespace": "gatewayserver/io/udp", "protocol": "udp", "remote_addr": "172.18.0.1:40570", "time": "2022-03-18T05:49:39Z", "toa": 0.123392}

@ingestionTime | 1647582575119
INFO Computed downlink start timestamp {"delta": 21292051.162105657, "gateway_eui": "647FDAFFFE01004D", "gateway_uid": "eui-647fdafffe01004d", "namespace": "gatewayserver/io/udp", "now": 90253313642345, "protocol": "udp", "remote_addr": "172.18.0.1:40570", "starts": 21382304475748000}

@ingestionTime | 1647582575119
INFO Scheduled downlink {"class": "CLASS_C", "data_rate": "lora:<bandwidth:500000 spreading_factor:10 > ", "delay": 21292051.162028152, "duration": 0.123392, "frequency": 922100000, "gateway_eui": "647FDAFFFE01004D", "gateway_uid": "eui-647fdafffe01004d", "namespace": "gatewayserver/io/udp", "now": 90253313719848, "protocol": "udp", "remote_addr": "172.18.0.1:40570", "rx_window": 2, "starts": 21382304475748000}

Is https://github.com/TheThingsNetwork/lorawan-stack/pull/5318 the fix for this?

adriansmares commented 2 years ago

Given that the difference between start and now in the Computed downlink start timestamp message is already quite bad (10^16), that PR won't help. There is a gap of ~116 days between the two timestamps, which is completely absurd. On a brighter note, we can be basically sure that this is caused by the clock synchronization system in TTS.

Based on the timestamps there seems to be a huge gap between the server time representation of the concentrator time (now) and the gateway representation of the concentrator time (starts). Normally the gap between starts and now should be just the delay between the current moment and the transmission + half RTT (since now is observed after half RTT).

Do you have access to the gateway logs near the moment when the incident happens ? It would be great if we could see some concentrator timestamps / GPS timestamps near that point.

You can also apply the following diff:

diff --git a/pkg/gatewayserver/io/io.go b/pkg/gatewayserver/io/io.go
index c8689f010..0374e9f0f 100644
--- a/pkg/gatewayserver/io/io.go
+++ b/pkg/gatewayserver/io/io.go
@@ -236,22 +236,25 @@ func (c *Connection) HandleUp(up *ttnpb.UplinkMessage, frontendSync *FrontendClo
                        "concentrator_time", frontendSync.ConcentratorTime,
                        "server_time", frontendSync.ServerTime,
                        "gateway_time", frontendSync.GatewayTime,
-               )).Debug("Gateway clocks have been synchronized by the frontend")
+               )).Info("Gateway clocks have been synchronized by the frontend")
        case up.Settings.Time != nil:
-               ct = c.scheduler.SyncWithGatewayAbsolute(up.Settings.Timestamp, *ttnpb.StdTime(up.ReceivedAt), *ttnpb.StdTime(up.Settings.Time))
+               serverTime := *ttnpb.StdTime(up.ReceivedAt)
+               gatewayTime := *ttnpb.StdTime(up.Settings.Time)
+               ct = c.scheduler.SyncWithGatewayAbsolute(up.Settings.Timestamp, serverTime, gatewayTime)
                log.FromContext(c.ctx).WithFields(log.Fields(
                        "timestamp", up.Settings.Timestamp,
                        "concentrator_time", ct,
-                       "server_time", up.ReceivedAt,
-                       "gateway_time", *up.Settings.Time,
-               )).Debug("Synchronized server and gateway absolute time")
+                       "server_time", serverTime,
+                       "gateway_time", gatewayTime,
+               )).Info("Synchronized server and gateway absolute time")
        case up.Settings.Time == nil:
-               ct = c.scheduler.Sync(up.Settings.Timestamp, *ttnpb.StdTime(up.ReceivedAt))
+               serverTime := *ttnpb.StdTime(up.ReceivedAt)
+               ct = c.scheduler.Sync(up.Settings.Timestamp, serverTime)
                log.FromContext(c.ctx).WithFields(log.Fields(
                        "timestamp", up.Settings.Timestamp,
                        "concentrator_time", ct,
-                       "server_time", up.ReceivedAt,
-               )).Debug("Synchronized server absolute time only")
+                       "server_time", serverTime,
+               )).Info("Synchronized server absolute time only")
        default:
                panic("unreachable")
        }

This will turn the debug clock synchronization messages into info messages, and we can check what the server_time/gateway_time was for one of the affected gateways.

One way I can imagine this thing happened was that the GPS signal was no longer available on the gateway, and then it 'defaulted' back to whatever NTP server / local system time it has. We still sync the 'gateway time' for these moments, and maybe the gateway 'system time' is flawed somehow.

You can test this hypothesis by disabling the time fallback using the following diff:

diff --git a/pkg/ttnpb/udp/translation.go b/pkg/ttnpb/udp/translation.go
index 3e62f5104..7301de7cf 100644
--- a/pkg/ttnpb/udp/translation.go
+++ b/pkg/ttnpb/udp/translation.go
@@ -201,8 +201,6 @@ func convertUplink(rx RxPacket, md UpstreamMetadata) (ttnpb.UplinkMessage, error
        switch {
        case rx.Tmms != nil:
                goTime = gpstime.Parse(time.Duration(*rx.Tmms) * time.Millisecond)
-       case rx.Time != nil:
-               goTime = time.Time(*rx.Time)
        }
        if !goTime.IsZero() {
                for _, md := range up.RxMetadata {

This will force the gateway to synchronize the gateway time only when the fine GPS timestamp is available.

johanstokking commented 2 years ago

Indeed, to see what's really going on here we need the clock synchronization logs as well, all of them that lead to the downlink scheduling itself. That is, these logs:

(so the ones that are promoted Info by the patch from @adriansmares above, the first one isn't triggered by UDP gateways)

I also suspect that the gateway reported absolute time is off and/or that it is interpreted wrongly.

onizmx commented 2 years ago

@adriansmares @johanstokking Thanks for the help. We've prepared 3.18.2 with the INFO changes. Not sure when this version will be deployed but I will keep you guys posted.

onizmx commented 2 years ago

We've found the issue on our gateway side. We had a process reading GPS positions via serial port and this was somehow messing with the GPS epoch. Last week, another issue popped up where multicast started breaking by gateway transmitting 40 seconds off from the given absoute time. This happened right after changing the script, so we've fully discarded the process. So far no multicast issues for 5 days in a row.

Thanks a lot for the support and apologies for the time spent investigating this issue.