GoogleChrome / lighthouse

Automated auditing, performance metrics, and best practices for the web.
https://developer.chrome.com/docs/lighthouse/overview/
Apache License 2.0
28.01k stars 9.31k forks source link

PSI API giving ERRORED_DOCUMENT_REQUEST error for some urls that worked recently #15989

Closed pushkarbh closed 4 weeks ago

pushkarbh commented 1 month ago

FAQ

URL

https://www.realtor.com/realestateandhomes-search/Chicago_IL

What happened?

The url https://www.realtor.com/realestateandhomes-search/Chicago_IL and some other valid urls from the same domain have started failing in the PSI API calls. We used PSI API for these urls for long time successfully but seeing these errors for past couple of weeks. Here is the error:

[Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)]

All these failing urls continue to work on https://pagespeed.web.dev. I checked bug reports for similar error but most of those are for lighthouse as opposed to PSI API. I see some possible causes listed in https://github.com/GoogleChrome/lighthouse/issues/2784, but curious why the same urls work successfully on the PSI site. We run the API from a Python script but same error can be reproduced by running the API on Postman as well.

Please suggest what can be done to resolve this.

What did you expect?

As mentioned earlier, these urls worked till couple weeks ago. We expect it to give us web vital data using field and lab metrics very similar to what we can see even now on https://pagespeed.web.dev.

What have you tried?

Tested different urls and validated on https://pagespeed.web.dev. Other urls from different sites we use in our test suite continue to work. Just the urls from this domain stopped working recently.

How were you running Lighthouse?

PageSpeed Insights, Other

Lighthouse Version

11.5.0

Chrome Version

119.0.0.0

Node Version

No response

OS

Linux & Mac

Relevant log output

{
    "error": {
        "code": 400,
        "message": "Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)",
        "errors": [
            {
                "message": "Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)",
                "domain": "lighthouse",
                "reason": "lighthouseUserError"
            }
        ]
    }
}
connorjclark commented 1 month ago

Does this still occur with 12.0 (we just updated PSI API)?

I just tried a few times and it seems to work for me. It may be an intermittent error.

pushkar-bh commented 1 month ago

I just tried using the endpoint we've been using "https://www.googleapis.com/pagespeedonline/v5/runPagespeed" and getting the same error still.

Here is the curl command - curl --location 'https://www.googleapis.com/pagespeedonline/v5/runPagespeed?key=<API-KEY>&url=https%3A%2F%2Fwww.realtor.com%2Frealestateandhomes-search%2FChicago_IL&strategy=mobile'

How do I test this api with 12.0? Using v12 as opposed to v5 gives a 404 error.

connorjclark commented 1 month ago

Thanks. I'll look further tomorrow.

How do I test this api with 12.0? Using v12 as opposed to v5 gives a 404 error.

You already are. There's only one PSI version (v5), but we update the LH version there (which is now 12).

pushkar-bh commented 1 month ago

Hopefully you're able to reproduce the issue. Let me know if not. Thanks!

connorjclark commented 1 month ago

I overlooked the 403 in your error message. I get the same locally when using the API, and also via plain usage of curl:

curl https://www.realtor.com/realestateandhomes-search/Chicago_IL -I

Seems your webserver is blocking UAs that indicate curl was used (or rather, that a web browser is not being used), which would explain failures of the API from programmatic usage.

The 403 error is coming from a machine in google making requests to your webserver, which IIUC should be the same via curl kicking off the API request or the webserver doing it.... so actually I'm really unsure why this could be happening. @paulirish mentions perhaps X-Forwarded-For is what varies, is your server perhaps checking that or any request headers and blocking access to some bots?

pushkar-bh commented 1 month ago

I tried curl https://www.realtor.com and it returns an error page with This page requires JavaScript! mentioned in the html response.

I don't work for realtor.com, so I won't be able to find out what has changed. But it seems like they've recently added some defense to non-browser accesses. This used to work, so must be a recent change.

Is there anyway to make this work by sending any custom headers to the PSI api? Thanks for looking into this.

pushkar-bh commented 1 month ago

I think the options of using PSI for the mentioned domain are limited given the bot control mechanism put in place. Can the CrUX API or CrUX History API be used to fetch the aggregated data from BigQuery without reaching the origin url?

connorjclark commented 1 month ago

We have some planned changes to the PSI api that preclude spending time on it now to still get the CruX parts of the API even if the Lighthouse part fails. For now, any error in the Lighthouse part will fail the entire request.

Is what you're looking for not part of these APIs? https://developer.chrome.com/docs/crux/methodology/tools#tool-crux-api or https://developer.chrome.com/docs/crux/methodology/tools#tool-crux-history-api

pushkar-bh commented 1 month ago

It would be great to have PSI api to return CrUX part despite Lighthouse failures. Do you have a rough idea when these changes may be available? Is it like 1-2 quarters or longer?

For now I'm going to see if we can use the CrUX or CrUX History api. Thanks!

paulirish commented 4 weeks ago

It would be great to have PSI api to return CrUX part despite Lighthouse failures. Do you have a rough idea when these changes may be available? Is it like 1-2 quarters or longer?

Unlikely.

For now I'm going to see if we can use the CrUX or CrUX History api. Thanks!

Good plan. :)