eclipse-hono / hono

Eclipse Hono™ Project
https://eclipse.dev/hono
Eclipse Public License 2.0
452 stars 137 forks source link

Hono resource limits : connection limits edge case #1866

Open bordeuax opened 4 years ago

bordeuax commented 4 years ago

Environment

...
  "enabled": true,
  "customer": " Test Inc.",
  "resource-limits": {
    "max-connections": 10,
    "data-volume": {
      ....
      },
      "effective-since": "2019-07-27T14:30:00Z"
    }
  },
  ...
}

Actions

Real life behavior

Expected behavior

Note

Questions

sophokles73 commented 4 years ago

The PrometheusBasedResourceLimitChecks class determines the number of currently connected devices by means of querying the Prometheus server. So, if the data in the Prometheus server is stale and indicates that the max number of connections is used up, then no additional devices will be able to connect. In order to reduce the lag until new connections are possible again, you could increase Prometheus server's frequency of scraping the adapters. However, it probably doesn't make much sense to scrape, say every 2 seconds.

In order to address the controlled rolling update scenario, we could probably improve the shut down process of protocol adapters:

  1. reject any new connection attempts from devices
  2. report 0 authenticated connections
  3. wait for at least the time it takes for the Prometheus server to scrape the latest metrics so that devices get a chance to connect to another adapter instance
  4. stop the TCP socket listener to trigger already connected devices to reconnect (to another adapter instance)
  5. shut down the adapter

WDYT? @kaniyan @bordeuax

bordeuax commented 4 years ago

@sophokles73 , this proposal regarding graceful shutdown will add more predictable behavior. But i have one remark to the 1st point 1. reject any new connection attempts from devices I think in this case we need to use health indicators like /liveness and /readiness. We need to keep /readiness in KO status and /liveness in OK status, the load-balancer (which have a health check point to /readiness ) will redirect the new traffic to the new pods. If we will reject the connections at the level of pod this will not help, because LB will continue forward the traffic to this pod (at least the probability is very high that the new traffic will go to the old pod ) and we can have exactly the same behavior of rejected connections .