Serverless (function-as-a-service) & lighthouse (NO_SCREENSHOTS, Load timeout)

doteric commented 1 year ago

Hello 👋 First of all thank you for maintaining this awesome open source tool!

I'm writing this issue to gather the problems that I've stumbled upon so far with using lighthouse on an AWS Lambda.

The setup I'm using is:

@sparticuz/chromium with the following chrome flags https://github.com/Sparticuz/chromium/blob/master/source/index.ts#L144 + additionally --disable-gpu (as an AWS Lambda Layer)
chrome-launcher (tried also puppeteer instead)
Lambda Memory - 1536MB

Problems I've faced with running lighthouse on AWS Lambda:

~~1. Load timeout This happens each time, for every test. It looks like one (or more) of the following conditions seem to never pass on a serverless environment and the timeout (45s by default) is always reached. https://github.com/GoogleChrome/lighthouse/blob/main/core/gather/driver/wait-for-condition.js#L405 I haven't investigated yet which one exactly is causing the issue, but maybe you have done it already? If not then I could maybe find some time next week to investigate this.~~ Read: https://github.com/GoogleChrome/lighthouse/issues/14955#issuecomment-1503085397

NO_SCREENSHOTS This problem happens very rarely and it causes the performance score (with the speed index) to not calculate. It's very hard to pinpoint the exact reason of this as it seems very random. It might be related to the opened chrome instance as I've noticed that on one instance all tests would not contain this error, but on a different one all tests would contain this error. However I cannot confirm this theory. If you could point me to the code that actually performs the screenshots and what could potentially fail in that process then maybe I could investigate that also.

I am fully aware of Avoid function-as-a-service infrastructure (Lambda, GCF, etc) inside https://github.com/GoogleChrome/lighthouse/blob/main/docs/variability.md#run-on-adequate-hardware , but I would like to know the reason behind this and whether it would be possible to actually support serverless as it's used very often. I'm guessing that someone already did some investigation around this so I would want to avoid duplicating the work and listen to the reasoning behind not supporting serverless infrastructure and if there are any possible fixes for the above issues I have listed. If you lack time to investigate particular parts, but think something should be possible then let us (the community) know so maybe somebody can help out.

I've also noticed the following issues, but none of them have provided a perfect solution to the above problems:

Appreciate any replies and help 💪

doteric commented 1 year ago

@adamraine , would you have some time by any chance to check this issue? 🙏 Would appreciate it :bow:

doteric commented 1 year ago

As for point 1 I think I've found the main reason:

const resolveOnCriticalNetworkIdle = waitForNetworkIdle(session, networkMonitor, {
    networkQuietThresholdMs,
    busyEvent: 'network-critical-busy',
    idleEvent: 'network-critical-idle',
    isIdle: recorder => recorder.isCriticalIdle(),
  });

inside https://github.com/GoogleChrome/lighthouse/blob/main/core/gather/driver/wait-for-condition.js#L440 Without it the timeout seems to not happen. But another thing that I noticed is that this does not happen on most websites, but just on particular ones (those that I want to test for example) and then I've noticed that this is not a strictly serverless related problem. Therefore removing point 1 from this issue as it should be treated separately... As of the problem itself it seems that an auth check that happens periodically is blocking the test to finish successfully. Therefore I will just try to block that request on LH level and see if that works.

So now only point 2 remains (as for the serverless problem) which I think overall is more important. Didn't find a good point to start the investigation on that yet. I have a question though - where and how are the "ScreenShots" kept during the test? Maybe it's inside some unsupported by lambda path and even though putting the files works they can be cleaned up almost instantly 🤔 (Just a guess/assumption, haven't investigated this).

paulirish commented 1 year ago

Are you using headless=new ?

doteric commented 1 year ago

Hey @paulirish, thank you for the reply 💪 I've tried both --headless='new' and --headless and the behavior seems pretty similar, but the odds might slightly be different (might be due to not enough tests done), it's seems pretty random whether the test will be good or bad.

doteric commented 1 year ago

@paulirish Would you maybe be able to point me to the place where the screenshots are gathered? I guess it's on the gatherers part, but I couldn't find how is it done exactly :/ Maybe I could try to debug it.

doteric commented 1 year ago

bump on this topic. @paulirish @connorjclark @adamraine @brendankenny Really sorry for bothering you guys, but by any chance any of you could provide me some more details on what could be happening and has this been investigated before? If not then with some extra details I could potentially try investigating this out. I'm guessing this could be an issue in chrome itself and the devtools protocol not returning all needed artifacts? Appreciate it 🙇

connorjclark commented 1 year ago

re: timeout. There is the --max-wait-for-load option, which default to 45000 (45s). You could set it higher for machines with variable load
re: NO_SCREENSHOTS. Maybe you need xvfb. See what we do in GHCI: https://github.com/GoogleChrome/lighthouse/blob/59c6d8e59cab9e2ed79d1db3770c7f928c6ae5b6/.github/workflows/smoke.yml#L54-L57

doteric commented 1 year ago

Thanks @connorjclark for the reply 🙇

I've already managed with this hence the strikethrough :D
That would be very interesting as it doesn't always fail, but only sometimes (50/50 kinda). I will try to look into this if by any chance some lambda container can have something more installed that some other doesn't, but that would be very weird... What do you think? I'll also try adding xvfb additionally and check if that helps in anything. Please also keep in mind that the final screenshot (full page screenshot) always creates fine, it's just the screenshots during the loading process seem to be missing 🤷‍♂️

doteric commented 1 year ago

I've recently started working with LightHouse user flows and I noticed that that some LH navigation tests work fine and some result in the exact same NO_SCREENSHOTS error in the exact same user flow which means same browser and all same settings, but still something is wrong. I initially thought it only seems to happen on the first run and it always is fine on the next runs, but then I managed to get a result where the 1st attempt is fine meanwhile then 2nd and 3rd error with NO_SCREENSHOTS and then 4th, 5th are fine.

Examples:

FYI. @connorjclark / @paulirish 🙇‍♂️

connorjclark commented 7 months ago

Could you extract the traces / LH artifacts from these bad runs and upload them here? We don't support running LH on lambda so I can't promise any resolution here, but I can take a look at the trace/artifacts and see if there is anything obviously wrong.

doteric commented 7 months ago

Thank you @connorjclark for replying :bow: Sure I can get some failed result artifacts for you for sure. Just please let me know, do you mean the RunnerResult.artifacts object as a JSON to be precise or something else? Also can it contain any sensitive information apart from what's on the actual website btw? So I can post it here publicly? Do not have the time to look through it.
Thank you for the help btw 💪

doteric commented 6 months ago

@connorjclark ping 🙏
If you have some time

connorjclark commented 6 months ago

do you mean the RunnerResult.artifacts object as a JSON to be precise or something else?

Yes, but it would be better as a zip of latest-run, which is a folder that is generated of the artifacts when you use the -G flag.

Also can it contain any sensitive information apart from what's on the actual website btw?

Treat it like it's giving someone full access to anything the browser devtools can show you. In general, this is not an issue.

doteric commented 6 months ago

@connorjclark Big thanks for the reply

Below is a json of the artifacts. example-fail-artifacts.json

Whenever I have time I will also do a run in gatherMode as you stated, that would produce the .zip file. It will require some fiddling with the AWS Lambda so it's not as straightforward, but hopefully I'll have time to do it this week.
Appreciate it btw :bow:

doteric commented 3 months ago

Hello @connorjclark , Sorry for taking so long, but I kinda forgot about this and never had a longer moment to go back to this. However, today I've decided to go back to this topic and grab the artifacts that you've asked for.

Artifacts of the failed run in zip: latest-run.zip

Hopefully that will help you identify the problem.

Cheers and big thanks again 🙇

GoogleChrome / lighthouse

Serverless (function-as-a-service) & lighthouse (NO_SCREENSHOTS, Load timeout) #14955