TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
936 stars 302 forks source link

last_status_received_at not available for basic station #3802

Closed virtualguy closed 4 months ago

virtualguy commented 3 years ago

Summary

There appear to be missing stats for the gateways when running basic station on ttn-lw-cli. In particular something to indicate 'last seen' the same as in the web console

Steps to Reproduce

Compare the output of ttn-lw-cli for a udp and a basic station. Note that this is a Tektelic Macro

ttn-lw-cli g get-connection-stats eui-647fdafffe009c15
{
  "connected_at": "2021-02-15T09:43:38.907265545Z",
  "protocol": "ws",
  "last_uplink_received_at": "2021-02-15T09:56:58.718329514Z",
  "uplink_count": "2605",
  "last_downlink_received_at": "2021-02-15T09:56:07.597691514Z",
  "downlink_count": "5",
  "sub_bands": [
    {
      "max_frequency": "18446744073709551615",
      "downlink_utilization_limit": 1,
      "downlink_utilization": 0.000071413335
    }
  ]
}

 ttn-lw-cli g get-connection-stats eui-647fdafffe009c14
{
  "connected_at": "2021-02-15T09:46:49.231311171Z",
  "protocol": "udp",
  "last_status_received_at": "2021-02-15T09:57:19.537013051Z",
  "last_status": {
    "time": "2021-02-15T09:57:19Z",
    "boot_time": "0001-01-01T00:00:00Z",
    "versions": {
      "ttn-lw-gateway-server": "3.10.7"
    },
    "antenna_locations": [
      {
        "latitude": -37.68502,
        "longitude": 175.58456,
        "altitude": 47
      }
    ],
    "ip": [
      "103.254.135.190"
    ],
    "metrics": {
      "ackr": 98.8,
      "rxfw": 84,
      "rxin": 117,
      "rxok": 84,
      "temp": 29,
      "txin": 11,
      "txok": 11
    }
  },
  "last_uplink_received_at": "2021-02-15T09:57:41.872832697Z",
  "uplink_count": "2067",
  "last_downlink_received_at": "2021-02-15T09:57:40.998823876Z",
  "downlink_count": "175",
  "round_trip_times": {
    "min": "0.033299841s",
    "max": "0.038470815s",
    "median": "0.034325134s",
    "count": 20
  },
  "sub_bands": [
    {
      "max_frequency": "18446744073709551615",
      "downlink_utilization_limit": 1,
      "downlink_utilization": 0.0013503467
    }
  ]
}

Environment

The Things Network Command-line Interface: ttn-lw-cli Version: 3.10.7 Build date: 2021-01-14T12:34:23Z Git commit: ecf52d6 Go version: go1.15.6 OS/Arch: linux/amd64

How do you propose to implement this?

...

How do you propose to test this?

I'm happy to test

Can you do this yourself and submit a Pull Request?

No

johanstokking commented 3 years ago

@KrishnaIyer is Basic Station even sending status messages? If not, what do we do with this?

KrishnaIyer commented 3 years ago

This is expected behaviour IMO.

The LoRa Basics Station LNS protocol does not support periodic status messages.

There's an open issue on the LBS repo but there isn't much going on with that afaik.

We need to wait for the protocol to add support.

virtualguy commented 3 years ago

@KrishnaIyer Can you use the connection state of the websocket? In the console a gateway will show as blue and connected before uplinks come through. I'm not sure how quickly this times out to change back to disconnected but it does change state eventually.

Connected/Disconnected is enough for me, perhaps that's exposed somewhere else in ttn-lw-cli?

KrishnaIyer commented 3 years ago

The way stats works in V3 is that we create a GatewayConnectionStats entry when the gateway connects and remove that entry when the gateway disconnects. So if you check for connections stats (ttn-lw-cli get-connnection-stats I think) and you get a 404, that means the gateway has disconnected. Else the gateway is connected.

I guess we can improve the UX on the console? But there is no additional need for using the WebSocket state as this is already taken care of in the way we handle gateway connections.

johanstokking commented 3 years ago

So if you check for connections stats (ttn-lw-cli get-connnection-stats I think) and you get a 404, that means the gateway has disconnected. Else the gateway is connected.

That's it indeed.

virtualguy commented 3 years ago

Right, that works. Knowing the time since disconnect would be a nice to have but I'm happy with how it is. Feel free to close the ticket

KrishnaIyer commented 3 years ago

Knowing the time since disconnect would be a nice to have

That's not a bad suggestion. But this cannot be a part of the stats itself in our design. This could be done via gateway disconnection events but that's a different discussion tracked internally. Closing this for now.

Thanks for reporting @virtualguy.

beitler commented 3 years ago

I feel we need to distinguish two things:

1) Status messages as aliveness indicator

It looks like people are interpreting the last_status_received_at as an indicator of aliveness. Coming from the connectionless world of the UDP packet forwarder, this certainly makes sense. However, with Basic Station, aliveness can be measured on the connection level using the TCP connection state. Using last_status_received_at as an aliveness indicator for Basic Station will not make much sense in the future. Status message intervals will be configurable and could potentially be configured to very long intervals (especially on bandwidth limited links).

The most efficient way to check aliveness is the TCP keepalive mechanism. This is exactly what Basic Station uses by default. Surely, respective techniques on upper layers are feasible as well (WS ping/pong, regular app-layer status messages sent by the gateway, status queries from the LNS), but will have their respective drawbacks.

2) Status messages for operational metrics

As for the actual status message, I think no matter what set of default metrics are going to be defined, it will never satisfy the data hunger of all gateway operators. And surely, the most important metric is going to be highly gateway platform specific and there is no way a generic gateway client, like Basic Station, is expected to integrate custom code to fetch that metric via some bus from some component.

Therefore, the way Basic Station addresses the topic of status messages from day one, is via generic event messages. Station supports the injection of arbitrary messages into the LNS protocol from the outside via named pipes (https://doc.sm.tc/station/conf.html?highlight=cmd#configuration-files). To create the named pipe (aka 'fifo'), type mkfifo cmd.fifo in station's home directory (restarting station is required after the fifo is created). Then, during runtime any external process on the host system (with write access to the named pipe) can inject JSON messages towards the LNS, like this:

echo '{"msgtype":"event", ...}' > cmd.fifo

In practise, let's say a solar powered gateway could have a cron job firing a script which collects all the necessary metrics from the different places, then constructs a JSON-formatted event and pushes it to the LNS via:

echo '{"msgtype":"event", "type": "status", "battery": 52, "solar": 123, "temp": 4, "last_full": 4201}' > cmd.fifo

This requires that the LNS just forwards the msgtype:event message through to some event log or MQTT topic, etc. to make it available to the application.

While I agree that a default set of internal metrics sent autonomously by Station makes a lot of sense, I think with the method described above most needs for status reporting are already satisfied. Wouldn't you agree? So, instead of waiting for the final status message format, @KrishnaIyer , maybe it's worth considering to handle msgtype:event messages on the LNS side?

virtualguy commented 3 years ago

From my point of view TTS is already determining connected vs disconnected and it would be great to expose this in a consistent way across both UDP and Basic Station. I.e. just replicate the blue connected indicator from the console. I appreciate that doesn't really belong in connection-stats.

That's also really interesting info about named pipes in Basic Station. Would be great for shipping arbitrary metrics and info. Though having a standardized set of core metrics/status info would be better for consistency across vendors

htdvisser commented 3 years ago

(1.) Let's make sure that we properly document what the last_status and last_status_received_at fields mean in the API reference, and also clarify that a successful response on GetGatewayConnectionStats means that the gateway is connected and a NotFound error means not connected (to this cluster).

(2.) I think it would be a good idea to detect these {"msgtype":"event", ...} messages. Perhaps we can allow gateways to send json-encoded google.protobuf.Any messages in there. If the Gateway Server then detects a GatewayStatus message, it will update last_status and last_status_received_at with that.

{
  "msgtype": "event",
  "payload": {
    "@type": "type.googleapis.com/ttn.lorawan.v3.GatewayStatus",
    "time": "0001-01-01T00:00:00Z",
    "boot_time": "0001-01-01T00:00:00Z",
    "metrics": {
      "cpu_percentage": 67.8,
      "load_1": 2.34,
      "load_5": 1.23,
      "load_15": 0.98,
      "temp": 34.5
    }
  }
}
KrishnaIyer commented 3 years ago

Yeah thanks @beitler for the explanation. Yeah this is certainly doable.