Closed millerick closed 2 years ago
Really hard to know what is going on with the information provided. A wild guess might be that Chrome crashes and the mechanism in 4.1.2 for connection retries kicks in and stays there for an extended time until it realizes it cannot connect.
We have changed the default and this retry mechanism is avoided now, and will be part of 4.1.3. So I can leave this issue open and wait until 4.1.3 so you can try again.
Really hard to know what is going on with the information provided.
I agree. In part, I was hoping to get some tips on further things that could be done to debug the situation. I also agree with your guess that Chrome is crashing.
My attempts to put together a minimal reproduction have not been successful. I am able to put together a grid locally with docker-compose and see the same Connection refused
errors, but they
Connection refused
happens.Looking forward to 4.1.3. We'll upgrade to it when it is available.
I notice that the hub api stops responding to my requests when I call the graphql endpoint to get status details. 4.1.3 did not fix the issue for me.
This is exactly the same as we are experiencing. In the mean time, we have added a liveness probe against graphql so that the hub container is rotated out by Kubernetes whenever this is experienced.
4.1.3 was released a couple of days ago. Did you see anything different?
I notice that the hub api stops responding to my requests when I call the graphql endpoint to get status details. 4.1.3 did not fix the issue for me.
This is exactly the same as we are experiencing. In the mean time, we have added a liveness probe against graphql so that the hub container is rotated out by Kubernetes whenever this is experienced.
I moved my response to https://github.com/SeleniumHQ/selenium/issues/10404. Might all be the same issue.
Most likely this was resolved in 4.1.3 or in 4.1.4, please comment if the issue is still happening.
Hello @diemol,
we did run an upgrade almost one month ago, after you recommended to do so.
at the same time, we switched our container's liveness probe from GET /status
to a graphql-based check, seen here:
#!/bin/bash
set -e
curl \
--connect-timeout 5 \
--max-time 10 \
-sSf \
-d '{"operationName":"GetNodes","vriables":{},"query":"{__typename}"}' \
'http://localhost:4444/graphql'
this seems to have been a successful mitigation; our containers now just rotate out when this fails.
we reenabled the previous /status
endpoint liveness probe and redeployed 4.1.4
at 9a this morning. by 12:30p, the grid was in its characteristic failing state once again. (both times pacific).
I managed to grab a heapdump. it is 78MB large. it seems there is a file size limit of 25MB for this file: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files if you would like a copy, how do you recommend I send this to you?
thank you, -Michael Evans
I am confused, is this the same issue? For liveness, we do recommend to use the /readyz
endpoint.
is this the same issue?
my apologies, I should have led with the fact that @millerick and I are operating the same grid(s). by the "same failure mode", I mean
... the hub becomes unresponsive to new [session] requests coming into it. The hub continues to respond to its health checks and serves HTTP requests, but no new sessions begin across any of the nodes
these portions of the thread opener. to clarify:
serves HTTP requests
the hub serves the /status
endpoint, and it serves all of the static frontend assets.
but it serves neither session requests nor graphql queries.
(the session requests are coming from a webdriverio v7 nodejs client).
(the frontend still loads, but then spins because no data is returned).
we do see that the hub will accept tcp connections for all of these http connections,
but it never responds with an HTTP response.
the hub does not close connections outright.
there are no corresponding logs for these incoming requests either.
the hub goes strangely quiet during while in this state.
I was unable to ascertain useful information with a log level of FINEST
.
I also setup a jaeger instance to receive traces, and observed the same lack of useful information.
some information about our setup:
we have two grid deployments ("sandbox" and "tools") across two different kubernetes clusters, one per aws account. (they do not share any infrastructural resources whatsoever, and the consumers of these grid deployments are likewise separated).
there is a difference in the number of network hops for consumers in on sandbox and tools, but we have seen this failure mode in both deployments.
the distribution of browser sessions we serve is more peaky in sandbox than in tools.
we serve most of our browser sessions in tools.
we see this issue more often in tools than in sandbox.
in sandbox, the deployment consists of 1 hub pod and 8 chrome node pods. (no autoscaling)
in tools, the deployment consists of 1 hub pod and 6 chrome node pods. (no autoscaling)
our chrome nodes are sized to serve 2 chrome browser sessions each.
our images are based on https://github.com/SeleniumHQ/docker-selenium
the hub is build from selenium/hub:4.1
, with some extra scripts to setup jaeger tracing.
the chrome nodes are built from selenium/node-chrome:4.1
, with scripts to setup jaeger tracing, manage kube graceful termination, tmpwatch to cleanup orphaned browser session files.
(our pods' filesystems are memory-mapped, so orphaned browser session files eat into our memory allotment).
For liveness, we do recommend to use the
/readyz
endpoint.
thank you. I was unaware of this endpoint.
we have been using /status
for all of our probes (save for the graphql hack we put in last month).
I will make this adjustment, and use /readyz
for all of our probes.
I don't know what /readyz
returns when the hub is in its failing state.
currently, it returns Router is true
for the hub.
I will report back if I observe the same failing behavior. in two weeks, if I don't see the hub become unresponsive during this time, I will report back as well. I will take another heap snapshot if I see the hub become unresponsive.
in the meantime, I would be happy to provide any extra information about our setup that you require. if you have anything you would like me to inspect about the grid, I would be happy to do so.
in the event the grid ceases to serve sessions, do you have any recommendations for what I should do to gather information that will best aid your investigation?
thank you, -Michael Evans
Hello @diemol,
our grid once again got itself into an unresponsive state.
the /readyz
endpoint during this time showed Router is true
,
no change from when the hub is operating normally.
I managed to grab a heap dump once again. it is still quite large.
but I did install a heap dump viewer, VisualVM
and am at least able to take a look around its contents.
nothing particularly stands out about the objects in memory.
something does stand out on the threads page, though.
here are the threads from the hub in a good state, a few minutes after we restarted it:
here are the threads from the hub while in a bad state:
it seems that thread pool 5 from the bad snapshot has quite a few more threads than thread pool 5 from the good snapshot.
also, all of these threads are in a TIMED_WAITING
state.
if I had to venture a guess, they seem to be deadlocked?
-Michael Evans
OK, there was an issue where sometimes the full HTTP message was not being read which caused the Netty pipeline to get stuck and not process any more requests. This was a fix in 4.1.4. We were able to fix because a test case to reproduce it was provided.
If this is still happening, would be great to get a way to reproduce it.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
What happened?
This is an issue that we have seen regularly since we began using Selenium Grid 4 a few months ago, and only recently were able to conjecture as to the cause. We have noticed that when Chrome fails to start on one of the nodes, the hub becomes unresponsive to new requests coming into it. The hub continues to respond to its health checks and serves HTTP requests, but no new sessions begin across any of the nodes. We only see this behavior on our selenium grid that receives very frequent traffic and almost always has one or two sessions running on it.
We have unfortunately not been able to reproduce outside of our Kubernetes cluster. We also have not been able to detect a pattern with when the issue occurs. Sometimes it will happen within hours of the selenium grid being redeployed on the cluster, and sometimes we will go days without running into the issue.
We are also unsure of what more can be done to debug the situation.
How can we reproduce the issue?
These are the environment variables that are supplied to the Chrome node containers:
These are the environment variables supplied to the Hub:
Relevant log output
Operating System
Selenium Grid 4.1.12
Selenium version
Selenium Grid 4.1.12
What are the browser(s) and version(s) where you see this issue?
Chrome
What are the browser driver(s) and version(s) where you see this issue?
ChromeDriver 99.0.4844.51
Are you using Selenium Grid?
4.1.2