Randomly getting "chrome not reachable" when creating new webdriver sessions

JonWallsten commented 3 years ago

🐛 Bug Report

To Reproduce

It's hard to reproduce since it happens maybe every 10-20th full test run where we run 3 applications in parallel with 3 test files in parallel in each application using Protractors shardTestFiles.

I'm not looking for a solution handed on a silver platter, but I have been investigating this for years in our project and I would like to get to the bottom with in. I have checked every relevant issue everywhere containing "chrome not reachable" leading me nowhere. I have been debugging with Node, but since it rarely happens I have not yet had it happen while debugging. I can only the see the error afterward.

06:28:58  [06:28:58] I/testLogger - [chrome #01-47] PID: 4095
06:28:58  [chrome #01-47] Specs: /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/packages/web-app-edit/test/e2e/suite/part/alias.e2e-spec.ts
06:28:58  [chrome #01-47] 
06:28:58  [chrome #01-47] [06:27:58] I/direct - Using ChromeDriver directly...
06:28:58  [chrome #01-47] 
06:28:58  [chrome #01-47] /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/lib/error.js:546
06:28:58  [chrome #01-47]         throw new ctor(message);
06:28:58  [chrome #01-47]               ^
06:28:58  [chrome #01-47] WebDriverError: chrome not reachable
06:28:58  [chrome #01-47]   (Driver info: chromedriver=89.0.4389.23 (61b08ee2c50024bab004e48d2b1b083cdbdac579-refs/branch-heads/4389@{#294}),platform=Linux 3.10.0-327.36.3.el7.x86_64 x86_64)
06:28:58  [chrome #01-47]     at Object.checkLegacyResponse (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/lib/error.js:546:15)
06:28:58  [chrome #01-47]     at parseHttpResponse (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/lib/http.js:509:13)
06:28:58  [chrome #01-47]     at /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/lib/http.js:441:30
06:28:58  [chrome #01-47]     at processTicksAndRejections (internal/process/task_queues.js:93:5)
06:28:58  [chrome #01-47] From: Task: WebDriver.createSession()
06:28:58  [chrome #01-47]     at Function.createSession (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/lib/webdriver.js:771:26)
06:28:58  [chrome #01-47]     at Function.createSession (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/selenium-webdriver/chrome.js:761:15)
06:28:58  [chrome #01-47]     at Direct.getNewDriver (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/protractor/built/driverProviders/direct.js:77:33)
06:28:58  [chrome #01-47]     at Runner.createBrowser (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/protractor/built/runner.js:195:43)
06:28:58  [chrome #01-47]     at /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/protractor/built/runner.js:339:29
06:28:58  [chrome #01-47]     at _fulfilled (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/q/q.js:834:54)
06:28:58  [chrome #01-47]     at /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/q/q.js:863:30
06:28:58  [chrome #01-47]     at Promise.promise.promiseDispatch (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/q/q.js:796:13)
06:28:58  [chrome #01-47]     at /opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/q/q.js:556:49
06:28:58  [chrome #01-47]     at runSingle (/opt/oas/jenkins/slave/workspace/EB-Components_OAS_oas-web_master/node_modules/q/q.js:137:13)

I had an idea patching the code to make it retry if the creation fails, but it seems like the code for creating the driver is synchronous even though the execution in the end is async. I have a hard time understanding where the actual session is created with all the flows and executions going in. The driver instance is returned as a ControlFlow promise and not the driver itself. Really confusing for someone not familiar with the code.

Can someone familiar with this project help me with ideas how to debug what's going on? I've read suggestions that settings a custom --user-data-dir could fix this because the issue could be that profiles/sessions in chrome are reused. But I added a random user dir for each instance, and it still happens. I have also read that it could be because ports are reused and that one driver uses a already running drivers port, so when the first is closed the second loses it connection. I would like to test this but I don't know where to set the custom port for this.

I guess this could also be the fault of ChromeDriver?

Thanks!

Environment

OS: Windows 10 1909 Browser: Puppeteer/Chromium (headless) Browser version: Chromium 89.0.4389.0 Browser Driver version: ChromeDriver 89.0.4389.23 Selenium-webdriver: 3.6.0

JonWallsten commented 3 years ago

Would it be possible to get more detailed information about why chrome is not reachable? Is chromedriver sending back that message?

I'm going to log the full response object here:

Also logging host/port here when we send the newSession command:

Then I can see if the port was used by someone else in close proximity.

JonWallsten commented 3 years ago

So this is probbaly not Seleniums fault.

I logged the full error object instead of just the message and reproduced the error.

{
  sessionId: '846e0e28fefcf9f84dd8aa6c79e3fe81',
  status: 100,
  value: {
    message: 'chrome not reachable\n' +
      '  (Driver info: chromedriver=89.0.4389.23 (61b08ee2c50024bab004e48d2b1b083cdbdac579-refs/branch-heads/4389@{#294}),platform=Linux 3.10.0-327.36.3.el7.x86_64 x86_64)'
  }
}

It seems like Chrome is crashing according to this commit comment: https://github.com/SeleniumHQ/selenium/commit/e85b830caf0f8342e7427d4d7a8136e312f24086

Chrome has its own specific error codes (see status.h). An example is
error code 100 which means that Chrome has crashed. Even though 10x
codes are not allowed by JSON wire protocol, let's make bindings a bit
less restrictive so that end user see driver error instead of a cryptic
`KeyError`.

Here is the source of kChromeNotReachable: https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/master/chrome/test/chromedriver/chrome/status.h

// WebDriver standard status codes.
enum StatusCode {
  kOk = 0,
  kInvalidSessionId = 6,
  kNoSuchElement = 7,
  kNoSuchFrame = 8,
  kUnknownCommand = 9,
  kStaleElementReference = 10,
  kElementNotVisible = 11,
  kInvalidElementState = 12,
  kUnknownError = 13,
  kJavaScriptError = 17,
  kXPathLookupError = 19,
  kTimeout = 21,
  kNoSuchWindow = 23,
  kInvalidCookieDomain = 24,
  kUnableToSetCookie = 25,
  kUnexpectedAlertOpen = 26,
  kNoSuchAlert = 27,
  kScriptTimeout = 28,
  kInvalidSelector = 32,
  kSessionNotCreated = 33,
  kMoveTargetOutOfBounds = 34,
  kElementNotInteractable = 60,
  kInvalidArgument = 61,
  kNoSuchCookie = 62,
  kElementClickIntercepted = 64,
  kUnsupportedOperation = 405,
  // Chrome-specific status codes.
  kChromeNotReachable = 100,
  kNoSuchExecutionContext,
  kDisconnected,
  kForbidden = 103,
  kTabCrashed,
  kTargetDetached,
  kUnexpectedAlertOpen_Keep,
};

Here is the usage: https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/master/chrome/test/chromedriver/chrome/devtools_http_client.cc

Status DevToolsHttpClient::Init(const base::TimeDelta& timeout) {
  if (!endpoint_.IsValid()) {
    return Status(kChromeNotReachable);
  }
  browser_info_.debugger_endpoint = endpoint_;
  base::TimeTicks deadline = base::TimeTicks::Now() + timeout;
  std::string version_url = endpoint_.GetVersionUrl();
  std::string data;
  while (!FetchUrlAndLog(version_url, &data) || data.empty()) {
    if (base::TimeTicks::Now() > deadline)
      return Status(kChromeNotReachable);
    base::PlatformThread::Sleep(base::TimeDelta::FromMilliseconds(50));
  }
  return ParseBrowserInfo(data, &browser_info_);
}
Status DevToolsHttpClient::GetWebViewsInfo(WebViewsInfo* views_info) {
  std::string data;
  if (!FetchUrlAndLog(endpoint_.GetListUrl(), &data))
    return Status(kChromeNotReachable);
  return internal::ParseWebViewsInfo(data, views_info);
}

So for some reason when creating a new session chrome sometimes crashes while running many processed in parallel. I ran 2 branches with 3 applications each running maximum of 5 parallels test files, so about 30 chromium processes active at any time.

I also double checked there wasn't a conflict with the ports.

I reproduced it in the second run. So it still happens relatively often. Some kind of retry functionality would have been nice. The I have no idea how to do that with the flow pattern in Selenium 3.6.0 since everything is promises.

JonWallsten commented 3 years ago

Created an issue here: https://bugs.chromium.org/p/chromedriver/issues/detail?id=3788

diemol commented 3 years ago

Thank you for debugging the issue and providing insightful info, @JonWallsten!

Yes, Selenium WebDriver passes the response returned from ChromeDriver. Sometimes this might be related to available resources, as Chrome is quite heavy during start up.

I will close this issue as you have already create one at the Chromium issue tracker. Thank you.

SeleniumHQ / selenium