[X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Deploy helm charts
Wait for the listener to restart
Describe the bug
Intermittently some of our listener pods will become unresponsive for 15-20 minutes. This is surfaced as long queue times for Workflows. This occurs ~4 times a day and usually correlated with load on the GHES server. It seems to happen in 'waves' impacting roughly ~90% of our listeners.
Observed behavior:
The listener will throw a context deadline exceeded (Client.Timeout exceeded while awaiting headers) - this error is repeated 3 times with 5 minute pauses between the events.
The listener throws: read tcp <REDACTED>:41054-><REDACTED>:443: read: connection timed out
One of the following occurs:
The controller restarts the listener pod and it comes back as healthy
No error message is thrown and the listener continues on as expected
The listener throws: Message queue token is expired during GetNextMessage, refreshing... and it continues on as expected.
During step 1 the listener is not functional and causes 15-20 minute down times.
Note: We do not observe any other connectivity issues with our instance of GHES. We are investigating issues with our connectivity to GHES and the resiliency of the server and compatibility with HTTP long polls. With that said, I think there may be an opportunity here to make the listeners more resilient to networking blips.
Describe the expected behavior
The listener is not restarted by the controller and doesn't become unresponsive for 15-20 minutes.
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Intermittently some of our listener pods will become unresponsive for 15-20 minutes. This is surfaced as long queue times for Workflows. This occurs ~4 times a day and usually correlated with load on the GHES server. It seems to happen in 'waves' impacting roughly ~90% of our listeners.
Observed behavior:
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
- this error is repeated 3 times with 5 minute pauses between the events.read tcp <REDACTED>:41054-><REDACTED>:443: read: connection timed out
Message queue token is expired during GetNextMessage, refreshing...
and it continues on as expected.During step 1 the listener is not functional and causes 15-20 minute down times.
Should this timeout be set to 1 minute? Is 5 minutes too long?
Note: We do not observe any other connectivity issues with our instance of GHES. We are investigating issues with our connectivity to GHES and the resiliency of the server and compatibility with HTTP long polls. With that said, I think there may be an opportunity here to make the listeners more resilient to networking blips.
Describe the expected behavior
The listener is not restarted by the controller and doesn't become unresponsive for 15-20 minutes.
Additional Context
Controller Logs
Runner Pod Logs