Closed rsimha closed 6 years ago
/cc @timhaines in case he's aware of the root cause. I suspect this has to do with how selenium / chrome behave on Travis CI in the presence of xvfb
(?). There are so many moving parts here that I'm not yet sure where the problem lies.
@rsimha this is something that's just started happening recently, and it's intermittent. Is that correct? So we don't know how to get a reliable reproduction right now?
My first thought is to try running some old builds and see if the run successfully, and then try bisecting to find a change that broke things. That will be difficult if this is intermittent though.
@timhaines Yes, this started recently, and is intermittent, but happens often enough that we rarely have a few successful builds in a row.
The solution in https://stackoverflow.com/a/46165690 looks promising. I'll try to switch from the chromedriver-helper
ruby gem to the one built in to selenium-webdriver
The solution in https://github.com/teamcapybara/capybara/blob/master/README.md#selecting-the-driver looks promising. I'll try to switch from the :chrome
driver to one of :selenium_chrome
or :selenium_chrome_headless
.
Edit: The real solution seems to be in the documentation at https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selenium/driver.rb#L46 It turns out that the Chrome flags were not being set up correctly. We must use Selenium::WebDriver::Chrome::Options.add_argument
instead of Selenium::WebDriver::Remote::Capabilities.chrome
@rsimha I'm skeptical you'll see things improve by moving to those options.
That is just swapping the driver to use these built in ones: https://github.com/teamcapybara/capybara/blob/63a12e7ba064fe326b36d014ede8c9269855a6af/lib/capybara.rb#L465-L482
If they do work, it might be because you're currently setting some add_emulation
options and no-sandbox
and disable-extensions
browser flags here: https://github.com/ampproject/amphtml/blob/master/build-system/tasks/visual-diff.rb#L228-L253
@rsimha I'm not sure if it's related, but I did see this comment thread about people having issues on the first test run due to the browser trying to request a page, and the server still spinning up.
@djones yeah, looks like swapping drivers won't really help. I do wait for the server to fully spin up and fetch /
from the server before trying to fetch any webpages with assets, so I'd have imagined that will avoid server timeout issues. Anyhow, I'll dig in some more tomorrow.
On further investigation, it turns out that the Chrome flags were not being set up correctly. We must use Selenium::WebDriver::Chrome::Options.add_argument
instead of Selenium::WebDriver::Remote::Capabilities.chrome
, based on the documentation in https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selenium/driver.rb#L46. I'll try this and see if the flakiness goes away.
The selenium options fix in #13705 didn't work: https://travis-ci.org/ampproject/amphtml/jobs/347529170#L651 (although it now ensures that we correctly use headless Chrome)
On to trying the suggestion in https://github.com/ampproject/amphtml/issues/13700#issuecomment-369094066
@timhaines @djones Unfortunately, we're still seeing timeouts during Selenium load. See this build from today, for example: https://travis-ci.org/ampproject/amphtml/jobs/351431084#L650
Looks like the suggestion in https://github.com/ampproject/amphtml/issues/13700#issuecomment-369094066 didn't do the trick.
It turns out this issue is no longer restricted to just push builds (with >100 snapshots). It has started rearing its head during PR builds (12 snapshots total). See https://travis-ci.org/ampproject/amphtml/jobs/353015572#L688 for example.
@timhaines @djones Any idea what I could try next to debug / work around this issue? Right now, the Percy workflow for AMP is fairly problematic due to this issue and https://github.com/percy/percy-capybara/issues/51.
@rsimha Thanks for the update. I emailed through a short message earlier. Do you know off hand if this error is happening on the first test, or if it's a subsequent test run? Would we be able to setup a branch that reduces the CI run to one test/snapshot, and loop that a few times to see if the error occurs in that scenario?
Hi @rsimha,
Nothing conclusive yet, but I thought I'd share the observations of my work yesterday.
I forked amphtml to our Percy GitHub account, and configured it to run on our travis and Percy accounts.
At first, I ran it with a script of gulp build && gulp visual-diff --headless
. This ran successfully for me 13 times without failure.
I then adjusted the script tag to gulp build && gulp visual-diff --headless --master
and got 3 failures among 8 builds.
The failure I received both times was:
/home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:69:in `assert_ok': unknown error: cannot determine loading status (Selenium::WebDriver::Error::UnknownError)
from unknown error: unknown sessionId
(Session info: headless chrome=65.0.3325.162)
(Driver info: chromedriver=2.36.540471 (9c759b81a907e70363c6312294d30b6ccccc2752),platform=Linux 4.4.0-101-generic x86_64)
from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:32:in `initialize'
When I tailed the chromedriver log, it showed that several page visits had run successfully, and then when the problem occurred, it pretty much gave the same output as shown in the error message.
[1521215531.295][INFO]: Waiting for pending navigations...
[1521215531.296][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.297][INFO]: Waiting for pending navigations...
[1521215531.297][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.297][INFO]: RESPONSE GetUrl "http://localhost:8000/examples/visual-tests/amp-by-example/index.html"
[1521215531.298][INFO]: COMMAND GetSource {
}
[1521215531.298][INFO]: Waiting for pending navigations...
[1521215531.298][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.319][INFO]: Waiting for pending navigations...
[1521215531.320][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.320][INFO]: RESPONSE GetSource "\u003C!DOCTYPE html>\u003Chtml xmlns=\"http://www.w3.org/1999/xhtml\" ⚡=\"\" amp-version=\"1520462477230\" class=\"i-amphtml-singledoc i-amphtml-standalone\" style=\"padding-top: 0px;\">\u003Chead>\u003Cstyle amp-runtime=\"\">h..."
[1521215531.757][INFO]: COMMAND DeleteAllCookies {
}
[1521215531.757][INFO]: Waiting for pending navigations...
[1521215531.758][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.772][INFO]: Waiting for pending navigations...
[1521215531.773][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.773][INFO]: RESPONSE DeleteAllCookies
[1521215531.776][INFO]: COMMAND Navigate {
"url": "http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-access/index.html"
}
[1521215531.776][INFO]: Waiting for pending navigations...
[1521215531.776][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.800][INFO]: Waiting for pending navigations...
[1521215531.973][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.973][INFO]: RESPONSE Navigate
[1521215531.974][INFO]: COMMAND FindElements {
"using": "css selector",
"value": ".i-amphtml-loader-dot"
}
[1521215531.974][INFO]: Waiting for pending navigations...
[1521215531.982][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.995][INFO]: Waiting for pending navigations...
[1521215532.027][INFO]: Done waiting for pending navigations. Status: ok
[1521215532.027][INFO]: RESPONSE FindElements [ ]
[1521215532.028][INFO]: COMMAND GetUrl {
}
[1521215532.028][INFO]: Waiting for pending navigations...
[1521215532.031][INFO]: Done waiting for pending navigations. Status: unknown error: cannot determine loading status
from unknown error: unknown sessionId
[1521215532.031][INFO]: RESPONSE GetUrl unknown error: cannot determine loading status
from unknown error: unknown sessionId
(Session info: headless chrome=65.0.3325.162)
[1521215532.132][INFO]: COMMAND Quit {
}
[1521215532.182][INFO]: RESPONSE Quit
After some searching, I've seen this reported a couple of times in the last few days -- it looks like a chromedriver/chrome update may have broken things: https://github.com/SeleniumHQ/selenium/issues/5609
I'd like to see if I can lock the chromedriver and chrome versions back to the previous versions, and see if this resolves the problem.
Also somewhat surprising was that I never saw the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
error that you've observed here. Maybe you've fixed this error with your recent changes, and we're now only seeing/revealing the one I described above?
@timhaines Thanks for looking into this. I too did some homework to see if others are seeing the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
error in the wild with this search. There were a few that looked similar, but nothing conclusive, like you said.
I'll set aside some time today to add chromedriver logging to our existing runs so we can see if anything different is happening with our builds.
@rsimha I noticed chromedriver 2.37 is being used by default this morning too. It's a new release that doesn't appear to be documented yet. https://sites.google.com/a/chromium.org/chromedriver/downloads
@timhaines Maybe it's time for me to add a gemfile.lock
? I've meant to do that, but never got down to it.
@rsimha I think that would be helpful generally. This is how you lock to a specific chromedriver version though. https://github.com/percy/amphtml/commit/f4f935bb7ae4661b7d66f9fda1bd74e9581fd0c1
@rsimha Since locking the chromedriver back to 2.35 with that previously linked commit, I've ran 5 builds with no failures. I'll keep re-running it today, but I'm optimistic this has fixed it. Would you like to try and lock to chromedriver 2.35 too, as it doesn't seem to introduce new problems.
In theory, chromedriver 2.35 isn't recommend for Chrome 65 (Travis's chrome-stable) - but it seems to work just fine for your builds. At some point in the future as Chrome versions progress, it'll probably need to be advanced from 2.35 to a future version.
If it is a bug in chromedriver, maybe the chromedriver team can fix it? The other person who reported the issue also said it was intermittent. https://github.com/SeleniumHQ/selenium/issues/5609
@timhaines That's really helpful. I'll send out a PR to lock down the chromedriver
version.
@rsimha Ha, just did. 👍 It's run successfully on all 11 of the builds I've tried since making the change. Previously it failed on 3 out of 8.
@timhaines 5 out of 5 builds on master
completed successfully after #14075 was merged. I don't think we've had 5 successful builds in quite a while before the PR, so I'm going to consider this issue fixed. Thanks a lot for the debugging help!
Seems like I spoke too soon :(
https://travis-ci.org/ampproject/amphtml/jobs/354998024#L657
@rsimha Interesting (and ugh!). The error I was seeing (and I think we've resolved by locking Google chromedriver to an older version) was always the unknown sessionId
error.
In the ~30 or so test runs I did, I never saw the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
error that's occurred here. I'll do some more reading on it.
@rsimha Could you enable the chromedriver logging and the output of the last 100 lines of it too please? I think that may give us some more information when this error occurs.
@rsimha A little more information - 31 of the 127 snapshots were sent up to our API on that build before the error occurred.
@timhaines I've sent you a PR that prints chromedriver.log
when a visual build fails. #14113
@rsimha Would you consider this resolved now?
I think one possible follow-on action item for your team is to follow-up with the Google chromedriver team and see if they can resolve the chromedriver bug that's been introduced in newer versions that caused the SessionID error.
@timhaines All builds over the weekend were successful. Thanks again for the help!
@timhaines Sadly, we have to reopen this as we're seeing several failures on master
. Here's an example failure, with chromedriver
logs printed after the failure: https://travis-ci.org/ampproject/amphtml/jobs/359884973#L657
@rsimha Thanks for the ping and the link. The chromedriver log is helpful here. It looks like selenium was trying to navigate to http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-o2-player/index.html
and chromedriver didn't offer a response in time.
I noticed in some of your other master builds today, some of the integration tests timed out. For example: https://travis-ci.org/ampproject/amphtml/jobs/359905017
Do you know what caused those? Could the two issues be related?
@timhaines during push builds, the integration tests run on a different shard from the visual tests, so I'd say the timeouts are unrelated.
Do the logs indicate how long chromedriver waited to load http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-o2-player/index.html
? Do you know of a way to increase the timeout? Or maybe add a retry?
Looking at this at a higher level, am I right in understanding that this is a problem with how chromedriver works with the rest of the Percy / Capybara stack? (I don't recall seeing timeouts like this when we used phantomjs.)
And finally, do you have a recommended, documented way to use chromedriver with Percy? (Meaning: could I be using it in a non-standard way?)
@rsimha Perhaps unrelated - but I'm wondering if there's something in the example pages - some AMP components, or the way they're being served - which is causing both chromedriver and the integration tests to time out.
And yes, it waited 2 minutes to load that page. You can tell from timestamps in the the command navigate and response navigate lines.
$ irb
2.3.0 :001 > Time.at 1522336662.399
=> 2018-03-29 08:17:42 -0700
2.3.0 :002 > Time.at 1522336792.508
=> 2018-03-29 08:19:52 -0700
2.3.0 :003 >
You could increase the timeout higher than 2 minutes - to 10 minutes or whatever you wanted. It would just be adjusting the '120' in the last PR I submitted. I'm not sure that increasing it is going to help though. If 2 minutes isn't enough for one page, 10 minutes is unlikely to be.
I think this timeout you've linked to is either your components or serving app blocking and not returning a response, or is a bug in either chromedriver or chrome. You can see some similar (but definitely different) bugs have been reported in chromedriver/chrome recently: https://bugs.chromium.org/p/chromedriver/issues/detail?id=2239
This isn't a bug in Percy at all - it's occurring during the process of selenium/chromedriver/chrome trying to navigate to a page, to then capture the dom snapshot, and later send to Percy. When you were using phantomJS, chromedriver+chrome weren't involved, which I think further suggests it's an issue with chromedriver/chrome. Do you have internal access to talk to the chromedriver team about it?
We don't have a documented way of using chromedriver/chrome with Percy. What you're doing, using Capybara to navigate pages, is probably closest to our percy-anywhere documentation, which suggests to use the poltergeist gem and phantomjs, and we haven't had anyone report this sort of problem with it. Having said that, I can understand phantomjs may not be well suited to what you're doing with AMP.
Lets finish up the commercial process with Malte, and then we'll be able to help out in more depth with this.
BTW, the other client we have that can similarly be used to navigate ad-hoc pages is our webdriver-io client, and I'm fairly certain that would use chromedriver if you selected chrome as your browser too - so you'd probably encounter this same error with that.
Appreciate the explanation, @timhaines! Happy to wait until the commercial process is complete before diving further into this :)
This is a high priority issue but it hasn't been updated in awhile. @rsimha Do you have any updates?
No longer relevant as we switched from Selenium to Puppeteer
This is causing
master
to go red often.For example, see https://travis-ci.org/ampproject/amphtml/jobs/346992913#L656