Open bhecquet opened 8 months ago
@bhecquet, thank you for creating this issue. We will troubleshoot it as soon as we can.
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template
label.
If the issue is a question, add the I-question
label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted
label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-*
label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer
label.
Thank you!
I may have a clue on the root cause I've a mechanism that prevent the node (through a custom SlotSelector) to run more that 2 sessions at a time except if I pass a specific capability so that, during a test, I can handle the case where a browser (e.g chrome) starts an other browser (e.g: Edge). This feature seems to break the distributor because it believes it has free slot whereas we prevent it from using them. Will see if a correction on my side solves the problem
Hello, after investigating and correcting my implementation, I thinks there is still a weakness in LocalDistributor implementation Imagine you have
Then all creation threads will be stuck and new session requests will arrive at regular interval because one node can still accept sessions. The LocalDistributor.sessionCreatorExecutor will then contain session request that may have already expired
Increasing thread pool size could be a workaround but would there be a mean to check that the sessionCreatorExecutor has room for creating sessions ?
I see that you have few resources, and when a Node is slow in creating a session, it affects the rest because not many threads are being processed.
You could increase the --max-threads
and see if it helps.
Hello @diemol ,
Thanks for your reply
Correct me if I'm wrong, but --max-threads
option read be BaseServerOptions.getMaxServerThreads() is not used in code (or I'm missing something), and it has no influence on LocalDistributor thread pool size. Anyway, I will try the --newsession-threadpool-size
option
You are correct; newsession-threadpool-size
should be the one. I just mixed the flags.
Hello,
increasing thread pool and also a filter on sessionCreatorExecutor queue, so that no new sessions requests are added if some are still waiting in queue. This seems to work pretty well. The most important thing is that if slowness come when creating sessions, we wait for queue to be empty to send more session requests to LocalDistributor
Thank you for sharing that information.
Do you think we need to do something else or we can close this issue?
As I said, I think something could be improved in handling new sessions request especially when hub or node are slow, namely to avoid creating sessions that are already timed out (for example, one could add a job like the one in LocalNewSessionQueue) that would remove session requests that are timed out, before trying to create them
But if I'm the only one to have this problem, then I can leave with my workaround that does this (and a bit more)
But if a session request has timed out, the Distributor should not be able to retrieve it. Maybe I do not understand the scenario.
In the normal process:
In case of high load (ten's of sessions received by the hub and few nodes to handle them) AND session-timeout (in my case 30 secs) set on NewSessionQueue:
Distributor takes the session request through 'getNextAvailable' so it's removed from LocalNewSessionQueue and 'timeoutSessions' won't handle it anymore until it returns to LocalNewSessionQueue
But the Distributor only takes more session requests if there is a stereotype (slot) available on the Node. That is why I am confused.
I think you talk about this portion of code:
This only check if there is at least one slot available. If it's the case, then (as soon as I understand the code correctly), then all the slots are returned. Imagine you have a node that can handle:
@bhecquet starting more sessions than slots is pervented later in this lines: https://github.com/SeleniumHQ/selenium/blob/b7d831db8cfeac9dfbf441c623c6d4bf580f2cc5/java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java#L550-L565
@joerg1985, yes, sure, but the problem is in the delay starting session when nodes / distributor are slow, not the fact that there are more sessions created than expected
@bhecquet my answer should say: rejecting these more new session requests than slots will happens pretty fast and should not add to much overhead in processing.
@joerg1985 , you're right, this step is quick. But in case a slot is free at this moment, a session will be created whereas it may already have expired, which takes time on overused LocalDistributor. And juste after that, the session will be removed because it has already timed out
Hello folks, especially to our distinguished hardworking contributors,
I was executing 50 tests + 30 every minute for 5 minutes. The first 3 minutes was flawless. Each batch of tests in the first 3 minutes completed on average under 3 minutes without any issues. The 4th and 5th batch got stuck due to the distributor and router consuming too much resources and starving the rest of the components including themselves.
Netty eats up a lot of memory (HashMap$TreeNode.find and JsonOutput.getMethod) and never goes down until bounced while the java component of distributor consumes a lot of cpu for the threads. Local Distributor - Session Creation and Local Distributor - New Session Queue gets locked up while HttpClient-*-SelectorManager waits patiently, from there everything goes haywire.
In addition to timed out sessions not removed from queue, it also continually provisions more browser nodes even if Queue size: 0 for a long time.
Below is how it looks like at the end of the run. There were no more tests in the queue but it went up as high as 831 idle browser nodes
I can reproduce this consistently on 4.23.0 (revision 77010cd) and would be willing to demo live.
Please provide your email if interested for a working session and see more details.
Thanks!
@rx4476 the leaking HttpClient-*-SelectorManager
threads should have been fixed, but these fixes are not released yet.
Commits 97d56d04e1b4ab4f8e527f8849b777c1e91d13f7, a5de3775db76dde54b9c6a94fbe3f2b816eacb80, ed3edee0ac4f09a5555aec7c7ab20609d8b394f2, 5bac4795ff6acc9f1ca1a6436aea0970ff98fb07 and 7612405e34d282992b26a22d7bee921753020026 are all related to this, so you could check the nightly snapshots.
@joerg1985 thank you for the update.
HttpClient-*-SelectorManager, and JdkHttpClient are more of a victim rather than the culprit. They have been waiting patiently from garbage collection and suspension to be over.
Garbage collection and memory allocation (Compared to LocalDistributor$NewSessionRunnable.run, SelectorManager allocation/utilization is negligible):
Waiting threads:
Here is where the thread gets locked:
Hello Folks,
Just wanted to provide an update. I installed 4.25.0 (revision 030fcf7918). Ran equal amount of tests with scaledjobs vs scaledobjects. The latter worked like a charm, out of 1,625 tests it passed 100%, with 2 requeues (meaning send the same test again when a session can't be created for that test).
When configured as scaledjobs, it crashes even when distributor memory utilization is still at <6 GiB (this didn't happen in 4.23). In the latest version it seemed like with the amount and frequency of tests that get executed at a given time, scaledobjects is the way to go as the pod doesn't need to go through a build/tear down cycle every time a test is ran. It takes more time in v4.25 to perform all those pod phase cycles for a job than in 4.23, eating up resources until they get terminated.
Distributor and router no longer eat up memory as much as it can, reaching as much as 18 GiB and 16 GiB respectively before they restart (but doesn't fix the issue) in v4.23 or lower. I had to build and deploy a pod cleaner to remove and start a new distributor/router pod to bring it back to a working condition.
Here's how it looks like at starting resource utilization vs peak: It goes back down to starting resource utilization when browser nodes scaled down or terminated.
v4.25 fixed most of the performance and stress issues, especially when socket is closed after a bad test run. And with that I'd like to express my sincerest gratitude to all the contributors and maintainers. You all are fantastic!
-r
One fix regarding session creation under high load has been done in https://github.com/SeleniumHQ/selenium/commit/e4ab299ea4d16943c18e8c31e9db1f7738ed9493. This was failing the StressTest
sometimes, where a short session timeout is used.
There has been another improvment with https://github.com/SeleniumHQ/selenium/commit/fc03c5e85699d9f572a04b37525e83f539b90ef7, to faster detect stale sessions.
One open point is to check the session requests are cancled before returning them from the LocalNewSessionQueue.getNextAvailable
.
What happened?
Hello,
on our setup (a hub with 15 nodes), under high load, we see that the hub tries to create sessions that are already timed out.
I can reproduce this with a rather smaller setup:
--session-request-timeout 30 --session-retry-interval 1000 --reject-unsupported-caps true
Hub is running on a small server (1 CPU)Expected results Grid continues to create session according to its capacity
Current result: No more session created: this is due to #12848 (which is a good point) which kills the browser immediately if the session is timed out At this point, client (see the logs) receives the message "New session request timed out " as expected
Reading the code in LocalNewSessionQueue, I can't see why this can happen because there are 2 guards
I've added some logs to grid and to capabilities to follow the session request flow
Looking at the logs, I can see 2 problems:
Do you confirm my analysis ?
A workaround would be to increase the CPU number / or thread pool of the executor, but this would only be a workaround
I'll try to imagine a correction, but don't hesitate to suggest ideas
Bertrand
How can we reproduce the issue?
Relevant log output
Operating System
Linux
Selenium version
4.11.0
What are the browser(s) and version(s) where you see this issue?
Chrome / not related to browser
What are the browser driver(s) and version(s) where you see this issue?
not related to browser
Are you using Selenium Grid?
4.16.1