Closed ThomasK33 closed 3 years ago
Downtime caused by Kafka heartbeats timing out due to slow writes / write timeouts to influxdb
Kafak redeployment with adjusted timeouts should resolve this issue.
After further investigation it turned out that this issue was caused by Influxdb either taking a long time to complete the write or not accepting the write and completely timing out.
Thus each service writing to Influxdb was taking huge amount of time until it would time out, meanwhile the Kafka broker interpreted this as a consumer going stale (either a live lock or not responding at all).
After deactivating any writes to influx, the issue did not show up anymore.
In
898fbeb
, API (https://api.fortify.gg/graphql?query=%7Bversion%7D) was down: