Hono resource limits : connection limits edge case

bordeuax commented 4 years ago

Environment

Kuebernetes v1.15
Prometheus v2.16.0
Hono deployment v1.1.1
An implementation or a mock for Hono tenant API which will return 10 for max number of connections :

...
  "enabled": true,
  "customer": " Test Inc.",
  "resource-limits": {
    "max-connections": 10,
    "data-volume": {
      ....
      },
      "effective-since": "2019-07-27T14:30:00Z"
    }
  },
  ...
}

Have 3(I think with 2 or 1 replica also is possible) replicas for MQTT or AMQP adapter(MQTT or AMQP will depend for which adapter you doing the test)
Connect 10 devices for test tenant (using AMQP or MQTT)

Actions

Perform kubernetes rolling restart for AMQP/MQTT pods

Real life behavior

when my devices loosing the connection and trying to reconnect again they will get connection reject errors , only after a while(this can take from 30s till 1 minute) my devices can reconnect again successful

Expected behavior

During the rolling update all my devices should not have any problems to reconnect again

Note

The rolling restart can be triggered by several factors : regular deployments, K8S scheduler ...
Possible causes for this misbehavior : Prometheus fetch interval, discovering time for a new pods or some inconsistency of data in Prometheus data base metrics

Questions

Did Hono community have some similar experience or know-how to optimize such edge cases ?

sophokles73 commented 4 years ago

The PrometheusBasedResourceLimitChecks class determines the number of currently connected devices by means of querying the Prometheus server. So, if the data in the Prometheus server is stale and indicates that the max number of connections is used up, then no additional devices will be able to connect. In order to reduce the lag until new connections are possible again, you could increase Prometheus server's frequency of scraping the adapters. However, it probably doesn't make much sense to scrape, say every 2 seconds.

In order to address the controlled rolling update scenario, we could probably improve the shut down process of protocol adapters:

reject any new connection attempts from devices
report 0 authenticated connections
wait for at least the time it takes for the Prometheus server to scrape the latest metrics so that devices get a chance to connect to another adapter instance
stop the TCP socket listener to trigger already connected devices to reconnect (to another adapter instance)
shut down the adapter

WDYT? @kaniyan @bordeuax

bordeuax commented 4 years ago

@sophokles73 , this proposal regarding graceful shutdown will add more predictable behavior. But i have one remark to the 1st point 1. reject any new connection attempts from devices I think in this case we need to use health indicators like /liveness and /readiness. We need to keep /readiness in KO status and /liveness in OK status, the load-balancer (which have a health check point to /readiness ) will redirect the new traffic to the new pods. If we will reject the connections at the level of pod this will not help, because LB will continue forward the traffic to this pod (at least the probability is very high that the new traffic will go to the old pod ) and we can have exactly the same behavior of rejected connections .

eclipse-hono / hono

Hono resource limits : connection limits edge case #1866