HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
19 stars 22 forks source link

New Custom metric to check for valid head #12

Closed csliva closed 2 years ago

csliva commented 2 years ago

According to these webmaster guidlines a head will be terminated by invalid HTML elements within the head node. This is worth tracking because Googlebot will terminate the head tag early and causing potential SEO issues.

This additional custom metric will return a boolean if any invalid elements are found.

Tests URL with broken head: https://crawler-test.com/other/non_head_tag_in_head WPT test URL: https://www.webpagetest.org/details.php?test=220518_AiDcAF_DFS&run=1&cached=0 Output: {"invalidElements":["div"],"invalidHead":true}

URL with correct head: https://developer.mozilla.org/en-US/ WPT test URL: https://www.webpagetest.org/details.php?test=220518_AiDcP0_DQW&run=1&cached=0 Output: {"invalidElements":[],"invalidHead":false}

URL with broken head: https://crawlgo.fly.dev/badhead WPT test URL: https://www.webpagetest.org/details.php?test=220525_BiDcTF_FW7&run=1&cached=0 Output: {"invalidElements":["div", "p"],"invalidHead":true}

tunetheweb commented 2 years ago

Hey @csliva could you run some test runs with this new custom metric on WebPageTest for a sample of common web pages (ideally one with a broken HEAD too that this detects) and include the links to the results as a comment in the PR?

More details here: https://github.com/HTTPArchive/custom-metrics#testing

csliva commented 2 years ago

Good call @tunetheweb, tests failed. WPT moves invalid elements back into the body so I think I'll have to use $WPT_BODIES.

andydavies commented 2 years ago

@csliva It's the browser that's doing it rather than WPT - when it builds the DOM it's going to truncate the head at the first invalid element

$WPT_BODIES should get you there but is probably a bit more work

csliva commented 2 years ago

URL with broken head: https://crawler-test.com/other/non_head_tag_in_head WPT test URL: https://www.webpagetest.org/details.php?test=220518_AiDcAF_DFS&run=1&cached=0 Output: {"invalidElements":["div"],"invalidHead":true}

URL with correct head: https://developer.mozilla.org/en-US/ WPT test URL: https://www.webpagetest.org/details.php?test=220518_AiDcP0_DQW&run=1&cached=0 Output: {"invalidElements":[],"invalidHead":false}

rviscomi commented 2 years ago

@csliva ping, a couple of comments still outstanding

jroakes commented 2 years ago

This looks really great! One note, we may want to force lower-case on tag name matching.

csliva commented 2 years ago

Additional tests with a test site I built to make sure XML parsing worked. URL with broken head: https://crawlgo.fly.dev/badhead WPT test URL: https://www.webpagetest.org/details.php?test=220525_BiDcTF_FW7&run=1&cached=0 Output: {"invalidElements":["div", "p"],"invalidHead":true}