protocolTimeout occurs intermittently since upgrading from v8.0.0 to v9.1.0

arenoir commented 1 year ago

Describe the bug We use github actions to spin up a local static site and run lighthouse against it using this tool. It has been working great for the last year or so. Recently we upgraded this package from v8.0.0 to v9.1.0 to take care of some github action deprications and are now having issues with it intermittently failing with a protocolTimeout error (see below).

It will usually pass if we re-run the action a couple of times, it's about 50/50. Note that the failure below happens after testing 4 pages successfully. There are 12 tests total 6 on mobile and 6 on desktop and it is random which one fails.

Is lighthouse running on port 4000 and not responding?

Thanks for creating a very useful tool and product.

lighthouse-check: Auditing mobile (1/6): http://localhost:8080/en-us
lighthouse-check: Auditing mobile (2/6):  http://localhost:8080/en-us/dog-dna-tests/essential
lighthouse-check: Auditing mobile (3/6):  http://localhost:8080/en-us/dog-dna-tests/premium
lighthouse-check: Auditing mobile (4/6):  http://localhost:8080/en-gb
lighthouse-check: Error below caught on retry 0 of 5. Error: connect ECONNREFUSED 127.0.0.1:4000
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1161:16) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 4000
} Trying again...
/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/driver.js:348
      asyncTimeout = setTimeout(reject, timeout, new LHError(LHError.errors.PROTOCOL_TIMEOUT, {
                                                 ^

LighthouseError [LHError]: PROTOCOL_TIMEOUT
    at /home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/driver.js:348:50
    at new Promise (<anonymous>)
    at Driver.sendCommandToSession (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/driver.js:346:28)
    at Driver.sendCommand (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/driver.js:369:17)
    at Object.clearDataForOrigin (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/driver/storage.js:58:19)
    at Function.disposeDriver (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/gather-runner.js:179:39)
    at Function.run (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/gather/gather-runner.js:532:20)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Function._gatherArtifactsFromBrowser (/home/runner/work/_actions/foo-software/lighthouse-check-action/v9.1.0/node_modules/lighthouse/lighthouse-core/runner.js:264:23) {
  code: 'PROTOCOL_TIMEOUT',
  friendlyMessage: {
    i18nId: 'lighthouse-core/lib/lh-error.js | protocolTimeout',
    values: {
      errorCode: 'PROTOCOL_TIMEOUT',
      protocolMethod: 'Storage.clearDataForOrigin'
    },
    formattedDefault: 'Waiting for DevTools protocol response has exceeded the allotted time. (Method: Storage.clearDataForOrigin)'
  },
  lhrRuntimeError: true,
  protocolMethod: 'Storage.clearDataForOrigin'
}

adamhenson commented 1 year ago

Hello - I'm sorry you are having problems. In order to debug your issue, I would need an easily reproducible way. Based on this, I have no option but to assume there is an issue with your app as you are getting the Lighthouse PROTOCOL_TIMEOUT error. This tells me that the GitHub Action ran Lighthouse correctly, but for some reason the websites being tested are timing out with that Lighthouse error. I 'm sorry to say I need to close this as it seems irrelevant to this GitHub Action until you can provide an easily reproducible example.

arenoir commented 1 year ago

Thanks @adamhenson, I would agree that it is an application error except for the time out is on port 4000 which is something to do with the lighthouse dev tool. We are rolling back to lighthouse-check-action@v8.0.0 for now.

adamhenson commented 1 year ago

I agree that is unusual because the error references the PROTOCOL_TIMEOUT error which is an error that Lighthouse responds with… meaning it did run, but then we see the connection refused on port 4000 which I agree is weird. I can’t remember which port Lighthouse uses. Any chance any part of your stack is trying to use port 4000? Can you provide your GitHub workflow configuration?

alyaothman14 commented 1 year ago

@adamhenson I am have seen this error as well intermittently and luck if finding out more info or fixing it

adamhenson commented 1 year ago

Again, I would need an easy way to reproduce in order to help. Please provide an easily reproducible example @alyaothman14 or @arenoir

alyaothman14 commented 1 year ago

@adamhenson Trying testing with more than 4 URLs it is a 50/50 chance Here is the github action step

- id: lighthouseCheck
        name: audit urls using Lighthouse
        uses: foo-software/lighthouse-check-action@v9.1.0
        with:
          device: mobile
          gitHubAccessToken: ${{ secrets.GITHUB_TOKEN }}
          maxRetries: 5
          outputDirectory: ${{ github.workspace }}/test
          prCommentEnabled: true
          prCommentSaveOld: false
          timeout: 600_000
          urls: 'https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/${{env.INDY_USER_NAME}},https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/${{env.INDY_USER_NAME}}/services,https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/${{env.CLIENT_USER_NAME}}/opportunities,https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/${{env.INDY_USER_NAME}}/recommendations,https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/p/${{env.INDY_PROJECT_SLUG}},https://web-app-${{ steps.setup_project.outputs.deployment_slug }}.reviewapp.url/s/${{env.INDY_SERVICE_SLUG}}'
          verbose: true
          wait: true

arenoir commented 1 year ago

@adamhenson I will try and put together a public test repo later today. Thanks!

arenoir commented 1 year ago

@adamhenson here is an example repo with a configuration. Notice the failure here. The test site are just bootstrap example pages. Thanks for taking a look.

adamhenson commented 1 year ago

Thanks @arenoir - I appreciate the reproduction. Things look a bit complicated there with the Docker factor. I wouldn't be surprised if the GitHub runner is exceeding a limit and shutting everything down with the bi-product being that error. I'm not sure but this helps. I've re-opened, but honestly - I'm not sure when I'll be able to spend time on this... maybe over the weekend.

arenoir commented 1 year ago

@adamhenson sounds good. The docker image is really light, it's just nginx serving static content. I have downgraded lighthouse-check-action to v8.0.0 here and it works great. I am wondering if it is a chrome issue? If I get some spare time I may try and dig into it. Thanks again for building a great tool.

adamhenson commented 1 year ago

Right, I just mean the Docker infrastructure in general. Not much has changed between 8 and 9 except the retry logic. Maybe it's somehow related. I'll take a closer look when I can.

adamhenson commented 1 year ago

tl;dr:

I'm closing this as it seems to be solved in9.1.1. Please let me know if it persists.

@arenoir thanks for providing the detailed example. It was very helpful and I was able to reproduce the problem in the lighthouse-check package used under the hood. The problem is quite unusual and I believe caused by a combination of Node 16, Chrome Launcher and Lighthouse, but I was able to workaround it by adding a 1 second delay in between consecutive audits as described in the new release which you should try 9.1.1. Ultimately, I think Chrome Launcher isn't losing communication with downstream consumers. It's difficult to pinpoint it, and I don't want to waste my energy opening an issue with GoogleChrome as would assume it will just grow stale.

My theory is that this issue began when we bumped our GitHub Action Runner to Node 16 from Node 14. I'm pretty sure that occurred in this new version, but it's difficult to tell because GitHub Actions require node_modules to be checked in which makes inspecting diffs nearly impossible. The error should not occur but I think it's happening because Chrome Launcher isn't correctly signaling that it's been launched or killed to consumers and this causes an issue with consecutive Lighthouse runs. When it does occur it should be caught by our try / catch / retry logic, which it seems to in Node 14 GitHub runners but not Node 16.

Well, I hope this fixes the issue. It was quite a frustrating Saturday for me as I spent most of my day on it thanks to these different dependencies (Lighthouse, Chrome Launcher, GitHub Runners). It's disappointing that GoogleChrome doesn't prioritize issues like this one that are almost 5 years old. If that issue didn't exist, the solution would be something like this... to retry when chrome.pid is undefined.

foo-software / lighthouse-check-action

protocolTimeout occurs intermittently since upgrading from v8.0.0 to v9.1.0 #109