HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
19 stars 21 forks source link

Investigate impact of first response body not being HTML #15

Closed rviscomi closed 2 years ago

rviscomi commented 2 years ago

Some custom metrics rely on $WPT_BODIES[0] being the main HTML document. However, we've seen some edge cases (on up to 15% of pages) where the first request does not correspond to the main document. These custom metrics would assumedly be processing the data for the wrong response body.

Investigate whether this is actually happening and how to fix it, if so.

rviscomi commented 2 years ago

It doesn't seem like the $WPT_BODIES object is affected by this issue.

Here is a test from the 2022_04_01_desktop crawl in which the URL was incorrectly parsed as http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt. The request at entries[0] in the HAR corresponds to the certificate and _is_base_page is set to true.

{
    "_full_url": "http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt",
    "_is_base_page": true,
    "_index": 0
}

I reran the test with a custom metric that outputs the full $WPT_BODIES object. The object schema is not quite the same as the HAR but it's clear that the first item is the HTML document itself, not the cert:

{
    "url": "https://52.mk/",
    "type": "Document"
}

So I think we're ok.

pmeenan commented 2 years ago

The waterfall and main request data are post-processed using the netlog trace events. The $WPT_BODIES (and $WPT_REQUESTS) use the dev tools request details which don't see things like OCSP checks so the first request should (hopefully) always be the actual navigation.

rviscomi commented 2 years ago

Great thanks for confirming