[🐛 Bug]: Selenium client does not terminate after Quit() and Dispose(), causing subsequent clients to fail due to a timeout

zaneclaes commented 7 months ago

What happened?

Please refer to this StackOverflow issue. Specifically, consider that variations of this solution have worked for many of the 100+ reporters of the bug:

So somewhere I read that after disposing the driver I should wait a few seconds before starting the next test, so I added the following line to dispose method:

    driver?.Quit();
    driver?.Dispose();
    Thread.Sleep(3000);

With this sleep modification I have no longer get the timeout error and there is no unnecessarily opened chromedriver.exe and chrome.exe processses.

How can we reproduce the issue?

Try to run Selenium in a Linux-based Docker container. Use any working Selenium code, but after Quit/Dispose, immediately run the code again. The fact that the Selenium code has not exited means that subsequent attempts to create a new driver exhibit the error. Adding the Thread.Sleep fixes the issue, but it is impossible to guess the correct amount of time to sleep. Most of the time 3 seconds is sufficient, but sometimes 30 seconds is insufficient — it depends on when precisely the client actually exits, so the only way it appears an end-user can concretely fix this issue is to implement some kind of process inspector which waits for Selenium to finish its job and the process to ACTUALLY exit.

Relevant log output

OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:33573/session/f21974e68d1a91919579c9b93fa35039/url timed out after 60 seconds.
 ---> System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing.
 ---> System.TimeoutException: The operation was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
 ---> System.Net.Sockets.SocketException (125): Operation canceled

Operating System

Linux (Ubuntu) + Docker

Selenium version

4.17.0

What are the browser(s) and version(s) where you see this issue?

Chromium (all versions)

What are the browser driver(s) and version(s) where you see this issue?

Chromium (all versions)

Are you using Selenium Grid?

No response

github-actions[bot] commented 7 months ago

@zaneclaes, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

nvborisenko commented 7 months ago

Everybody here executes a lot of tests, even in parallel.

@zaneclaes when you say:

Try to run Selenium in a Linux-based Docker container. Use any working Selenium code, but after Quit/Dispose, immediately run the code again.

can you provide exact code which fails?

zaneclaes commented 7 months ago

Everybody here executes a lot of tests, even in parallel.

I am aware of that. Nonetheless, in the virtualized machine conditions in this ticket, the bug appears, as is demonstrated by the many comments. Perhaps I should have stated that the problem is Linux-Docker specific in the subject? The bug does not appear at all any of our non-virtualized Mac and Windows machines.

can you provide exact code which fails?

Sorry, I thought the multiple examples I cited in the linked Stack Overflow thread would be enough. Here's my own:

  public async Task LoadPage() {
    ChromeOptions options = new ChromeOptions();
    options.AddArgument("headless");
    options.AddArgument("--no-sandbox"); 
    options.AddArgument("--disable-infobars"); 
    options.AddArgument("--disable-extensions");
    options.AddArgument("--disable-gpu"); 
    options.AddArgument("--disable-dev-shm-usage"); 

    ChromeDriver driver = new ChromeDriver(options);
    driver.Navigate().GoToUrl("https://google.com");
    await Task.Delay(TimeSpan.FromSeconds(30)); // delay to simulate some work
    driver.Quit();
    driver.Dispose();
  }

Then just call await LoadPage() twice in immediate succession.

But perhaps the more relevant bit is that the code must be running in a Docker container, i.e.,

ARG DOCKER_REGISTRY="mcr.microsoft.com"
ARG DOTNET_VERSION="8.0"
ARG DOTNET_RUNTIME="bookworm-slim"

FROM $DOCKER_REGISTRY/dotnet/aspnet:$DOTNET_VERSION-$DOTNET_RUNTIME

# ... application code ...

# Install Chrome (for Selenium  )
RUN apt-get update && apt-get install -y \
 apt-transport-https \
 ca-certificates \
 curl \
 gnupg \
 hicolor-icon-theme \
 libcanberra-gtk* \
 libgl1-mesa-dri \
 libgl1-mesa-glx \
 libpango1.0-0 \
 libpulse0 \
 libv4l-0 \
 fonts-symbola \
 --no-install-recommends \
 && curl -sSL https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
 && echo "deb [arch=amd64] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google.list \
 && apt-get update && apt-get install -y \
 google-chrome-stable \
 --no-install-recommends \
 && apt-get purge --auto-remove -y curl \
 && rm -rf /var/lib/apt/lists/*

zaneclaes commented 7 months ago

Well... this is strange. I was trying to create a minimal repro (since I'm headed on PTO this weekend), and the exact same code running on the same physical Docker host does not trigger the issue. I can only conclude there are some confounding factors, i.e., something else in the application is somehow interfering with Selenium's ability to gracefully shut down. I guess I'm not as confident any more that this is actually a Selenium bug, though I'm not sure how to go about diagnosing the root cause when in our app. Nonetheless, it feels appropriate to close this issue...

titusfortner commented 7 months ago

That SO post is about Firefox and an implementation that is no longer used. The problems people are reporting there are kind of all over the place, so I don't think there are hundreds of people with the same issue.
You shouldn't need to quit & dispose? Quit should do everything necessary.
Chrome by default creates a new process for each new session and can do many at the same time; you don't even need to quit the previous session for the next session request to work.
A timeout during a call to the url endpoint is almost always related to the site not completely loading. You can try setting the pageLoadStrategy to "Eager" or "None" to see if that is the issue

nvborisenko commented 7 months ago

Lol, I couldn't reproduce the issue on WSL, works smoothly. In any case come back to us with reproducible sample.

zaneclaes commented 7 months ago

That SO post is about Firefox and an implementation that is no longer used

While that is true of the OP, it is not true of many of the responses/answers, where there are 25+ upvotes for both Firefox and Chrome requiring a timeout to work successfully.

The problems people are reporting there are kind of all over the place, so I don't think there are hundreds of people with the same issue.

I'm frustrated by this too, but it's unsurprising considering the error message is so broad and uninformative that it is effectively useless.

Quit should do everything necessary.

This is an anti-pattern. Per the Microsoft definition of the Dispose pattern, implementing IDisposable implies that there are additional resources to be cleaned up (beyond ad-hoc methods like Quit) and thus failure to dispose a disposable can cause memory leaks. Personally I started with only a Dispose() but added the Quit() in my many attempts to figure out what could be going wrong.

Chrome by default creates a new process for each new session and can do many at the same time; you don't even need to quit the previous session for the next session request to work.

Yet, waiting after a Quit before starting a second session always fixes the problem. If the issue is unrelated to multiple sessions as you seem to be suggesting... how can we possibly explain the fix?

A timeout during a call to the url endpoint is almost always related to the site not completely loading. You can try setting the pageLoadStrategy to "Eager" or "None" to see if that is the issue

Good to know!

Lol, I couldn't reproduce the issue on WSL,

Nobody ever said the issue appeared on WSL, and in fact I explicitly stated (several times now, including in the OP) it only occurred in a docker environment... so I'm not sure what you're laughing at.

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

SeleniumHQ / selenium