There are a number of containers which are launched on Titus that do not
have a configuration which allows them to publish metrics. This leads to
two additional sources of noise in that environment:
HTTP timeouts with three retries will be encountered every 5 seconds
and there will be a failed attempt to parse a non-existent error response.
Since no metrics are sent, the daemon will abort every minute, leaving
behind core files, which will eventually fill up the disk.
This change removes the ensure_not_stuck watchdog from the main upkeep
loop, to prevent the aborts from occurring. We keep the function in the
code, so that we can rapidly re-enable it for snapshot builds, if we need
it for debugging.
For HTTP timeouts, we now set the http_code to a -1 value, which matches
the value we set for metrics. The aggregator response handler covers this
case separately and does not produce additional logs.
There are a number of containers which are launched on Titus that do not have a configuration which allows them to publish metrics. This leads to two additional sources of noise in that environment:
This change removes the
ensure_not_stuck
watchdog from the main upkeep loop, to prevent the aborts from occurring. We keep the function in the code, so that we can rapidly re-enable it for snapshot builds, if we need it for debugging.For HTTP timeouts, we now set the
http_code
to a-1
value, which matches the value we set for metrics. The aggregator response handler covers this case separately and does not produce additional logs.