GoogleChrome / lighthouse-ci

Automate running Lighthouse for every commit, viewing the changes, and preventing regressions
Apache License 2.0
6.34k stars 635 forks source link

Feature: Specific exit codes depending on failure #381

Open ardevelop opened 4 years ago

ardevelop commented 4 years ago

Context

Occasionally Lighthouse fails with PROTOCOL_TIMEOUT error (which according to https://github.com/GoogleChrome/lighthouse/issues/6512 happens when tool is not able to get response from browser within 30s interval):

LHError: PROTOCOL_TIMEOUT
  at Timeout._onTimeout (/home/lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:352:21)
  at listOnTimeout (internal/timers.js:549:17)
  at processTimers (internal/timers.js:492:7) 

Consequences

In case the error occurs CI check fails

Workaround

CI Tools might handle the exit code and for this specific case trigger a retry, for instance Buildkite might handle it pretty well: https://buildkite.com/docs/pipelines/command-step#automatic-retry-attributes

patrickhulce commented 4 years ago

Thanks for filing @ardevelop! Lighthouse CI already retries each Lighthouse run up to 3 times internally when it encounters PROTOCOL_TIMEOUT and other related failure types, so if we finally surfaced it to you, something was pretty rough with the build environment.

Surfacing the specific exit code from Lighthouse is a good idea though! I'm open to a PR for this if anyone is interested :)

dimension85 commented 4 years ago

Hi Patrick, AR notes 'Occasionally Lighthouse fails with PROTOCOL_TIMEOUT error' - however, this is not occasional as the problem is completely repeatable programatically but as an online browser test it works fine. Also you note that 'rough with the build environment' - which build environment is this? Is this something in my environment or one of the installed components. I am running Node with v6.1.1 and a pretty basic test to create this every time. How do I identify the cause of the problem for this specific site so I can get its resolved? Phil

patrickhulce commented 4 years ago

however, this is not occasional as the problem is completely repeatable programatically

Please file an issue in Lighthouse with the exact repro steps if this is repeatable. Thus far, everyone who has ever encountered this problem in the past 2 years (myself included) has been unable to provide steps to reliably reproduce the issue on any given machine :(

Also you note that 'rough with the build environment' - which build environment is this?

If the machine lacks the resources to properly run Chrome, then you are bound to run into this issue of Chrome not responding within 30s at some point. Sometimes a CI machine that gets assigned to your task is utterly bogged down such that it cannot handle running Chrome. In this case you could say that PROTOCOL_TIMEOUT is working as intendedish.

How do I identify the cause of the problem for this specific site so I can get its resolved?

If it's truly repeatable, run lighthouse by itself and report the protocol method that gets logged as the culprit to the tracking issue. If it's clearStorage then use --collect.settings.disableStorageReset flag in Lighthouse CI.

dimension85 commented 4 years ago

Thank you for your insights - this has given me something to further investigate as do far I have only run this using the recommended programmatic approach.

I have been unable to find, although possible I missed it, any doc that defines the minimum server requirements for running Lighthouse programatically. So far I have run almost 1000 tests against several different sites and although I have had other PROTOCOL_TIMEOUT issues they have been resolved by applying changes observed from the main thread. Is there a recommended server config (CPU/Memory etc for a Node based implementation?

Once I have concluded some further tests I will document the scenario in an issue as requested.

patrickhulce commented 4 years ago

The lighthouse core docs describe minimum recommendations.

If you do end up finding truly reproducible steps that would be awesome :)

dimension85 commented 4 years ago

I checked out the core doc and found that the server I was using was under sized on memory but have found that both CPU and Memory are not being exhausted when this failure occurs. To take this out of the mix I spun up the foloowing:

the following lighthouse cli was run lighthouse https://www.etihad.com/ --chrome-flags="--headless" --output json --output ht ml --output-path ./myfile.json --collect.settings.disableStorageReset --verbose

The result is always the same. image

I then created a new instance and repeated the above to prove it was repeatable - same result occurred.

On 3 different environments ( AWS small and large Lightsail and AWS EC2 instance) the same problem occurs for this software build. Let me know if you need anything else

patrickhulce commented 4 years ago

Thanks for sharing those steps @dimension85 I really appreciate the effort you gone to helping find a repro here ! Those logs are indicating that Chrome is crashing in the middle of the run (the abrupt "Disconnecting from browser..." means the process exited), PROTOCOL_TIMEOUT is kind of masking the underlying cause here.

I've filed https://github.com/GoogleChrome/lighthouse/issues/11124, would you mind chiming in there with what version of Chromium and linux are you running in EC2?

For Chromium 83.0.4103.61 and Ubuntu 18.04.4 LTS, I cannot reproduce this behavior :/

viniagostini commented 2 years ago

Hello folks, any news related to the specific exit codes? I have a CI pipeline and I need to handle assert failures and runtime errors differently. I am currently relying on the execution logs, but this is not a very robust solution.