ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

Selenium frequently times out while loading test pages during Percy builds on master #13700

Closed rsimha closed 6 years ago

rsimha commented 6 years ago

This is causing master to go red often.

For example, see https://travis-ci.org/ampproject/amphtml/jobs/346992913#L656

pr-check.js: Running gulp visual-diff --headless --master...
[21:33:59] Using gulpfile ~/build/ampproject/amphtml/gulpfile.js
[21:33:59] Starting 'visual-diff'...
INFO: Started Percy build 550164...
/home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/protocol.rb:176:in `rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/protocol.rb:154:in `readuntil'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/protocol.rb:164:in `readline'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http/response.rb:40:in `read_status_line'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http/response.rb:29:in `read_new'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1446:in `block in transport_request'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1443:in `catch'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1443:in `transport_request'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1416:in `request'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1409:in `block in request'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:877:in `start'
    from /home/travis/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:1407:in `request'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/http/default.rb:121:in `response_for'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/http/default.rb:76:in `request'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/http/common.rb:59:in `call'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/bridge.rb:164:in `execute'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/oss/bridge.rb:579:in `execute'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/remote/oss/bridge.rb:50:in `get'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.9.0/lib/selenium/webdriver/common/navigation.rb:30:in `to'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:49:in `visit'
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/capybara-2.18.0/lib/capybara/session.rb:274:in `visit'
    from build-system/tasks/visual-diff.rb:361:in `block in snapshot_webpages'
    from build-system/tasks/visual-diff.rb:349:in `each'
    from build-system/tasks/visual-diff.rb:349:in `snapshot_webpages'
    from build-system/tasks/visual-diff.rb:334:in `block in generate_snapshots'
    from build-system/tasks/visual-diff.rb:323:in `each'
    from build-system/tasks/visual-diff.rb:323:in `generate_snapshots'
    from build-system/tasks/visual-diff.rb:274:in `run_visual_tests'
    from build-system/tasks/visual-diff.rb:496:in `main'
    from build-system/tasks/visual-diff.rb:501:in `<main>'
rsimha commented 6 years ago

/cc @timhaines in case he's aware of the root cause. I suspect this has to do with how selenium / chrome behave on Travis CI in the presence of xvfb (?). There are so many moving parts here that I'm not yet sure where the problem lies.

rsimha commented 6 years ago

Here are a couple of things we do in .travis.yml:

export DISPLAY=:99.0
sh -e /etc/init.d/xvfb start

And here are the flags we use to launch Chrome during the visual diff tests:

--no-sandbox --disable-extensions --headless --disable-gpu
timhaines commented 6 years ago

@rsimha this is something that's just started happening recently, and it's intermittent. Is that correct? So we don't know how to get a reliable reproduction right now?

timhaines commented 6 years ago

My first thought is to try running some old builds and see if the run successfully, and then try bisecting to find a change that broke things. That will be difficult if this is intermittent though.

rsimha commented 6 years ago

@timhaines Yes, this started recently, and is intermittent, but happens often enough that we rarely have a few successful builds in a row.

The solution in https://stackoverflow.com/a/46165690 looks promising. I'll try to switch from the chromedriver-helper ruby gem to the one built in to selenium-webdriver

The solution in https://github.com/teamcapybara/capybara/blob/master/README.md#selecting-the-driver looks promising. I'll try to switch from the :chrome driver to one of :selenium_chrome or :selenium_chrome_headless.

Edit: The real solution seems to be in the documentation at https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selenium/driver.rb#L46 It turns out that the Chrome flags were not being set up correctly. We must use Selenium::WebDriver::Chrome::Options.add_argument instead of Selenium::WebDriver::Remote::Capabilities.chrome

djones commented 6 years ago

@rsimha I'm skeptical you'll see things improve by moving to those options.

That is just swapping the driver to use these built in ones: https://github.com/teamcapybara/capybara/blob/63a12e7ba064fe326b36d014ede8c9269855a6af/lib/capybara.rb#L465-L482

If they do work, it might be because you're currently setting some add_emulation options and no-sandbox and disable-extensions browser flags here: https://github.com/ampproject/amphtml/blob/master/build-system/tasks/visual-diff.rb#L228-L253

timhaines commented 6 years ago

@rsimha I'm not sure if it's related, but I did see this comment thread about people having issues on the first test run due to the browser trying to request a page, and the server still spinning up.

rsimha commented 6 years ago

@djones yeah, looks like swapping drivers won't really help. I do wait for the server to fully spin up and fetch / from the server before trying to fetch any webpages with assets, so I'd have imagined that will avoid server timeout issues. Anyhow, I'll dig in some more tomorrow.

rsimha commented 6 years ago

On further investigation, it turns out that the Chrome flags were not being set up correctly. We must use Selenium::WebDriver::Chrome::Options.add_argument instead of Selenium::WebDriver::Remote::Capabilities.chrome, based on the documentation in https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selenium/driver.rb#L46. I'll try this and see if the flakiness goes away.

rsimha commented 6 years ago

The selenium options fix in #13705 didn't work: https://travis-ci.org/ampproject/amphtml/jobs/347529170#L651 (although it now ensures that we correctly use headless Chrome)

On to trying the suggestion in https://github.com/ampproject/amphtml/issues/13700#issuecomment-369094066

rsimha commented 6 years ago

@timhaines @djones Unfortunately, we're still seeing timeouts during Selenium load. See this build from today, for example: https://travis-ci.org/ampproject/amphtml/jobs/351431084#L650

Looks like the suggestion in https://github.com/ampproject/amphtml/issues/13700#issuecomment-369094066 didn't do the trick.

rsimha commented 6 years ago

It turns out this issue is no longer restricted to just push builds (with >100 snapshots). It has started rearing its head during PR builds (12 snapshots total). See https://travis-ci.org/ampproject/amphtml/jobs/353015572#L688 for example.

@timhaines @djones Any idea what I could try next to debug / work around this issue? Right now, the Percy workflow for AMP is fairly problematic due to this issue and https://github.com/percy/percy-capybara/issues/51.

timhaines commented 6 years ago

@rsimha Thanks for the update. I emailed through a short message earlier. Do you know off hand if this error is happening on the first test, or if it's a subsequent test run? Would we be able to setup a branch that reduces the CI run to one test/snapshot, and loop that a few times to see if the error occurs in that scenario?

timhaines commented 6 years ago

Hi @rsimha,

Nothing conclusive yet, but I thought I'd share the observations of my work yesterday.

I forked amphtml to our Percy GitHub account, and configured it to run on our travis and Percy accounts.

At first, I ran it with a script of gulp build && gulp visual-diff --headless. This ran successfully for me 13 times without failure.

I then adjusted the script tag to gulp build && gulp visual-diff --headless --master and got 3 failures among 8 builds.

The failure I received both times was:

/home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:69:in `assert_ok': unknown error: cannot determine loading status (Selenium::WebDriver::Error::UnknownError)
from unknown error: unknown sessionId
  (Session info: headless chrome=65.0.3325.162)
  (Driver info: chromedriver=2.36.540471 (9c759b81a907e70363c6312294d30b6ccccc2752),platform=Linux 4.4.0-101-generic x86_64)
    from /home/travis/.rvm/gems/ruby-2.4.1/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:32:in `initialize'

When I tailed the chromedriver log, it showed that several page visits had run successfully, and then when the problem occurred, it pretty much gave the same output as shown in the error message.

[1521215531.295][INFO]: Waiting for pending navigations...
[1521215531.296][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.297][INFO]: Waiting for pending navigations...
[1521215531.297][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.297][INFO]: RESPONSE GetUrl "http://localhost:8000/examples/visual-tests/amp-by-example/index.html"
[1521215531.298][INFO]: COMMAND GetSource {
}
[1521215531.298][INFO]: Waiting for pending navigations...
[1521215531.298][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.319][INFO]: Waiting for pending navigations...
[1521215531.320][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.320][INFO]: RESPONSE GetSource "\u003C!DOCTYPE html>\u003Chtml xmlns=\"http://www.w3.org/1999/xhtml\" ⚡=\"\" amp-version=\"1520462477230\" class=\"i-amphtml-singledoc i-amphtml-standalone\" style=\"padding-top: 0px;\">\u003Chead>\u003Cstyle amp-runtime=\"\">h..."
[1521215531.757][INFO]: COMMAND DeleteAllCookies {
}
[1521215531.757][INFO]: Waiting for pending navigations...
[1521215531.758][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.772][INFO]: Waiting for pending navigations...
[1521215531.773][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.773][INFO]: RESPONSE DeleteAllCookies
[1521215531.776][INFO]: COMMAND Navigate {
   "url": "http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-access/index.html"
}
[1521215531.776][INFO]: Waiting for pending navigations...
[1521215531.776][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.800][INFO]: Waiting for pending navigations...
[1521215531.973][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.973][INFO]: RESPONSE Navigate
[1521215531.974][INFO]: COMMAND FindElements {
   "using": "css selector",
   "value": ".i-amphtml-loader-dot"
}
[1521215531.974][INFO]: Waiting for pending navigations...
[1521215531.982][INFO]: Done waiting for pending navigations. Status: ok
[1521215531.995][INFO]: Waiting for pending navigations...
[1521215532.027][INFO]: Done waiting for pending navigations. Status: ok
[1521215532.027][INFO]: RESPONSE FindElements [  ]
[1521215532.028][INFO]: COMMAND GetUrl {
}
[1521215532.028][INFO]: Waiting for pending navigations...
[1521215532.031][INFO]: Done waiting for pending navigations. Status: unknown error: cannot determine loading status
from unknown error: unknown sessionId
[1521215532.031][INFO]: RESPONSE GetUrl unknown error: cannot determine loading status
from unknown error: unknown sessionId
  (Session info: headless chrome=65.0.3325.162)
[1521215532.132][INFO]: COMMAND Quit {
}
[1521215532.182][INFO]: RESPONSE Quit

After some searching, I've seen this reported a couple of times in the last few days -- it looks like a chromedriver/chrome update may have broken things: https://github.com/SeleniumHQ/selenium/issues/5609

I'd like to see if I can lock the chromedriver and chrome versions back to the previous versions, and see if this resolves the problem.

Also somewhat surprising was that I never saw the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout) error that you've observed here. Maybe you've fixed this error with your recent changes, and we're now only seeing/revealing the one I described above?

rsimha commented 6 years ago

@timhaines Thanks for looking into this. I too did some homework to see if others are seeing the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout) error in the wild with this search. There were a few that looked similar, but nothing conclusive, like you said.

I'll set aside some time today to add chromedriver logging to our existing runs so we can see if anything different is happening with our builds.

timhaines commented 6 years ago

@rsimha I noticed chromedriver 2.37 is being used by default this morning too. It's a new release that doesn't appear to be documented yet. https://sites.google.com/a/chromium.org/chromedriver/downloads

rsimha commented 6 years ago

@timhaines Maybe it's time for me to add a gemfile.lock? I've meant to do that, but never got down to it.

timhaines commented 6 years ago

@rsimha I think that would be helpful generally. This is how you lock to a specific chromedriver version though. https://github.com/percy/amphtml/commit/f4f935bb7ae4661b7d66f9fda1bd74e9581fd0c1

timhaines commented 6 years ago

@rsimha Since locking the chromedriver back to 2.35 with that previously linked commit, I've ran 5 builds with no failures. I'll keep re-running it today, but I'm optimistic this has fixed it. Would you like to try and lock to chromedriver 2.35 too, as it doesn't seem to introduce new problems.

In theory, chromedriver 2.35 isn't recommend for Chrome 65 (Travis's chrome-stable) - but it seems to work just fine for your builds. At some point in the future as Chrome versions progress, it'll probably need to be advanced from 2.35 to a future version.

If it is a bug in chromedriver, maybe the chromedriver team can fix it? The other person who reported the issue also said it was intermittent. https://github.com/SeleniumHQ/selenium/issues/5609

rsimha commented 6 years ago

@timhaines That's really helpful. I'll send out a PR to lock down the chromedriver version.

timhaines commented 6 years ago

@rsimha Ha, just did. 👍 It's run successfully on all 11 of the builds I've tried since making the change. Previously it failed on 3 out of 8.

rsimha commented 6 years ago

@timhaines 5 out of 5 builds on master completed successfully after #14075 was merged. I don't think we've had 5 successful builds in quite a while before the PR, so I'm going to consider this issue fixed. Thanks a lot for the debugging help!

rsimha commented 6 years ago

Seems like I spoke too soon :(

https://travis-ci.org/ampproject/amphtml/jobs/354998024#L657

timhaines commented 6 years ago

@rsimha Interesting (and ugh!). The error I was seeing (and I think we've resolved by locking Google chromedriver to an older version) was always the unknown sessionId error.

In the ~30 or so test runs I did, I never saw the rbuf_fill': Net::ReadTimeout (Net::ReadTimeout) error that's occurred here. I'll do some more reading on it.

timhaines commented 6 years ago

@rsimha Could you enable the chromedriver logging and the output of the last 100 lines of it too please? I think that may give us some more information when this error occurs.

timhaines commented 6 years ago

@rsimha A little more information - 31 of the 127 snapshots were sent up to our API on that build before the error occurred.

rsimha commented 6 years ago

@timhaines I've sent you a PR that prints chromedriver.log when a visual build fails. #14113

timhaines commented 6 years ago

@rsimha Would you consider this resolved now?

I think one possible follow-on action item for your team is to follow-up with the Google chromedriver team and see if they can resolve the chromedriver bug that's been introduced in newer versions that caused the SessionID error.

rsimha commented 6 years ago

@timhaines All builds over the weekend were successful. Thanks again for the help!

rsimha commented 6 years ago

@timhaines Sadly, we have to reopen this as we're seeing several failures on master. Here's an example failure, with chromedriver logs printed after the failure: https://travis-ci.org/ampproject/amphtml/jobs/359884973#L657

timhaines commented 6 years ago

@rsimha Thanks for the ping and the link. The chromedriver log is helpful here. It looks like selenium was trying to navigate to http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-o2-player/index.html and chromedriver didn't offer a response in time.

I noticed in some of your other master builds today, some of the integration tests timed out. For example: https://travis-ci.org/ampproject/amphtml/jobs/359905017

Do you know what caused those? Could the two issues be related?

rsimha commented 6 years ago

@timhaines during push builds, the integration tests run on a different shard from the visual tests, so I'd say the timeouts are unrelated.

Do the logs indicate how long chromedriver waited to load http://localhost:8000/examples/visual-tests/amp-by-example/components/amp-o2-player/index.html? Do you know of a way to increase the timeout? Or maybe add a retry?

Looking at this at a higher level, am I right in understanding that this is a problem with how chromedriver works with the rest of the Percy / Capybara stack? (I don't recall seeing timeouts like this when we used phantomjs.)

And finally, do you have a recommended, documented way to use chromedriver with Percy? (Meaning: could I be using it in a non-standard way?)

timhaines commented 6 years ago

@rsimha Perhaps unrelated - but I'm wondering if there's something in the example pages - some AMP components, or the way they're being served - which is causing both chromedriver and the integration tests to time out.

And yes, it waited 2 minutes to load that page. You can tell from timestamps in the the command navigate and response navigate lines.

$ irb
2.3.0 :001 > Time.at 1522336662.399
 => 2018-03-29 08:17:42 -0700
2.3.0 :002 > Time.at 1522336792.508
 => 2018-03-29 08:19:52 -0700
2.3.0 :003 >

You could increase the timeout higher than 2 minutes - to 10 minutes or whatever you wanted. It would just be adjusting the '120' in the last PR I submitted. I'm not sure that increasing it is going to help though. If 2 minutes isn't enough for one page, 10 minutes is unlikely to be.

I think this timeout you've linked to is either your components or serving app blocking and not returning a response, or is a bug in either chromedriver or chrome. You can see some similar (but definitely different) bugs have been reported in chromedriver/chrome recently: https://bugs.chromium.org/p/chromedriver/issues/detail?id=2239

This isn't a bug in Percy at all - it's occurring during the process of selenium/chromedriver/chrome trying to navigate to a page, to then capture the dom snapshot, and later send to Percy. When you were using phantomJS, chromedriver+chrome weren't involved, which I think further suggests it's an issue with chromedriver/chrome. Do you have internal access to talk to the chromedriver team about it?

We don't have a documented way of using chromedriver/chrome with Percy. What you're doing, using Capybara to navigate pages, is probably closest to our percy-anywhere documentation, which suggests to use the poltergeist gem and phantomjs, and we haven't had anyone report this sort of problem with it. Having said that, I can understand phantomjs may not be well suited to what you're doing with AMP.

Lets finish up the commercial process with Malte, and then we'll be able to help out in more depth with this.

timhaines commented 6 years ago

BTW, the other client we have that can similarly be used to navigate ad-hoc pages is our webdriver-io client, and I'm fairly certain that would use chromedriver if you selected chrome as your browser too - so you'd probably encounter this same error with that.

rsimha commented 6 years ago

Appreciate the explanation, @timhaines! Happy to wait until the commercial process is complete before diving further into this :)

ampprojectbot commented 6 years ago

This is a high priority issue but it hasn't been updated in awhile. @rsimha Do you have any updates?

danielrozenberg commented 6 years ago

No longer relevant as we switched from Selenium to Puppeteer