HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

New metrics and bug fixes #172

Closed foxdavidj closed 4 years ago

foxdavidj commented 4 years ago

I've looked through all of the metrics from 2019 and created new functions to minimize the use of regex this year.

Here are the summaries:

Metric to Function mapping

Metric Function to use
04_04.sql inline_svg_stats
04_06.sql images.(total_*)
04_06b.sql images.sizes_values
04_08.sql image_stats.total_pictures_with_img
04_09a.sql meta-nodes should already take care what is portable here
04_09b.sql meta-nodes should already take care what is portable here
04_09c.sql meta-nodes should already take care what is portable here
04_09d.sql meta-nodes and images.sizes_values
04_19.sql videos
04_26.sql images.images_using_loading_prop
08_32.sql Using headers
08_33.sql Using headers
08_35-37.sql Using headers
08_39.sql scripts and link-nodes
08_42.sql Using headers
09_02.sql nodes_using_role
09_04.sql nodes_using_role
09_05.sql nodes_using_role
09_05b.sql nodes_using_role
09_06.sql total_nodes_with_duplicate_ids
09_11.sql heading-order from Axe https://dequeuniversity.com/rules/axe/3.0/heading-order
09_13.sql headings_order
09_14.sql shortcuts_stats
09_15.sql shortcuts_stats
09_16.sql Can use input-elements
09_16b.sql Can use input-elements
09_17.sql nodes_using_role
09_19a.sql attributes_used_on_elements
09_19b.sql attributes_used_on_elements
09_19c.sql attributes_used_on_elements
09_22.sql body_node
09_24.sql attributes_used_on_elements
09_32b.sql images.alt_tag_lengths
10_02.sql html_node
10_04a.sql link-nodes
10_04b.sql link-nodes
10_07b.sql document_title
10_16.sql length_of_h1s
14_02.sql meta-nodes
14_03.sql meta-nodes

Metrics unable to create a function for

Metric Reason
01_18.sql Its scanning javascript
01_24.sql Its scanning javascript
01_25.sql Its scanning javascript
02_11.sql Cant grab css comments from the DOM
02_45.sql Possible, but would bloat the custom metrics. The result of this would be an array of numbers where each number represents the number of classes that were used on an element. So this list would be gigantic... But because we want to use percentiles there's no other way to store this information. Thoughts?
11_06.sql Its scanning javascript

Metrics which were moved

  1. has_picture_img is now images.total_pictures_with_img. Instead of reporting a boolean it reports how many occurrences there were
rviscomi commented 4 years ago

If we really needed to scrape the JavaScript responses, maybe this would help: https://timkadlec.com/remembers/2020-04-16-webpagetest-custom-metrics-with-request-data/

Edit: Actually, @Tiggerito isn't this almost exactly the use case you wanted for SEO? (comparing initial SSR HTML state with dynamic CSR state)

The second custom metric, num-total-scripts, should look familiar—that grabs how many script elements appear in the final version of the document, after the page has been loaded and all JavaScript has run.

The first custom metric, num-initial-scripts, counts the number of script elements in the initial HTML. First it grabs the body of the first response using the $WPT_BODIES placeholder. Since that returns a string, we then convert it to HTML so we can parse it more easily. Finally, once we have HTML, we query it as we did the original document.

[num-initial-scripts]
let html = $WPT_BODIES[0].response_body;
let wrapper = document.createElement('div');
wrapper.innerHTML = html;

return wrapper.querySelectorAll('script').length;

[num-total-scripts]
return document.querySelectorAll('script').length;
rviscomi commented 4 years ago

@obto this looks amazing! Really nice work!

I'll do a thorough code review soon, but for now could you test this on a bunch of websites and share the WPT results? See https://github.com/HTTPArchive/legacy.httparchive.org/pull/170#issuecomment-661658511 for example.

FYI @HTTPArchive/analysts: this mega PR may handle many of the custom metrics needed for your chapters.

Tiggerito commented 4 years ago

If we really needed to scrape the JavaScript responses, maybe this would help: https://timkadlec.com/remembers/2020-04-16-webpagetest-custom-metrics-with-request-data/

Edit: Actually, @Tiggerito isn't this almost exactly the use case you wanted for SEO? (comparing initial SSR HTML state with dynamic CSR state)

The second custom metric, num-total-scripts, should look familiar—that grabs how many script elements appear in the final version of the document, after the page has been loaded and all JavaScript has run. The first custom metric, num-initial-scripts, counts the number of script elements in the initial HTML. First it grabs the body of the first response using the $WPT_BODIES placeholder. Since that returns a string, we then convert it to HTML so we can parse it more easily. Finally, once we have HTML, we query it as we did the original document.

[num-initial-scripts]
let html = $WPT_BODIES[0].response_body;
let wrapper = document.createElement('div');
wrapper.innerHTML = html;

return wrapper.querySelectorAll('script').length;

[num-total-scripts]
return document.querySelectorAll('script').length;

$WPT_BODIES could be a great solution. I've just done something similar to get http headers. So much easier to analyse everything in one place. 😎

Tiggerito commented 4 years ago

@obto @rviscomi I've already manually reviewed and updated our scripts for 2020 in a draft pull request. And with the $WPT_BODIES idea I think I can get rid of the last evil one 😁

Tiggerito commented 4 years ago

Sorry @obto. Just realised this is for the almanac.js file. I've also a draft pull request for that which has quite a bit of overlap.

How should I go about pulling things together?

pmeenan commented 4 years ago

Just a quick word of caution about WPT_REQUESTS and WPT_BODIES. I'm not sure how they will do at HA-scale on some of the more extreme pages (5000+ requests, 50+MB of uncompressed scripts, etc). If possible I'd recommend finding a few of the extreme sites and testing the custom metrics just to make sure it doesn't accidentally break the test.

If multiple scripts are together in one custom metric, it would be worthwhile to have a single global instance of WPT_BODIES (and no WPT_REQUESTS since bodies includes the requests) to make sure we're not multiplying the impact out.

The bodies/requests are serialized to JSON and inserted into the custom metric. The custom metric is executed by sending the script over the websocket dev tools interface where it needs to be evaluated and executed. I'm not sure how well it will handle hundreds of MB of script.

foxdavidj commented 4 years ago

@Tiggerito I've tried to modularize a lot of the data we're collecting and make it multi-purpose instead of specific per chapter (e.g., see the 'images' function in the PR).

After taking a quick scan of yours, it looks like there's a good deal of overlap because of this. Think it might be easiest if I pull in your changes to this PR and have you review it. What do you think?

foxdavidj commented 4 years ago

Or alternatively, we could review this PR and merge it. Then I can open up another PR to add the missing data that you've added in your PR.

What do you think @rviscomi? Merge this PR then bring in the additional SEO changes? Or get it all done at once?

Sorry about all this @Tiggerito. Hope I'm not stepping on your toes too much here :sweat_drops:

rviscomi commented 4 years ago

What do you think @rviscomi? Merge this PR then bring in the additional SEO changes? Or get it all done at once?

I'll defer to whatever workflow is easiest for you and @Tiggerito to resolve the impending merge conflicts

rviscomi commented 4 years ago

If multiple scripts are together in one custom metric, it would be worthwhile to have a single global instance of WPT_BODIES (and no WPT_REQUESTS since bodies includes the requests) to make sure we're not multiplying the impact out.

That's a good point. So:

If a page has so much data that it makes $WPT_BODIES totally infeasible, hopefully we can still collect metrics in almanac.js. Even if not, those few cases would be outliers.

Tiggerito commented 4 years ago

What do you think @rviscomi? Merge this PR then bring in the additional SEO changes? Or get it all done at once?

I'll defer to whatever workflow is easiest for you and @Tiggerito to resolve the impending merge conflicts

I presume the style used by @obto is more correct than mine and his is closer to being merged. I'd say merge that in now and I can rework mine based on the new version. I presume that means I create a brand new fork and edit that (I'm new to this)?

While I wait for that merge I can continue adding features and testing. It looks like I have permission to use $WPT_BODIES which is great.

With the changes @obto made, I presume we are free to refactor the script in areas used by other chapters. Some SEO scripts were used by others last year so I deliberately did not change their structure.

foxdavidj commented 4 years ago

@Tiggerito Sounds good. So i can help you out as much as possible, can you make a bullet list of all the stats you were writing custom metrics for?

I can show you which metrics already exist in mine and help you write the rest then :)

Tiggerito commented 4 years ago

@Tiggerito Sounds good. So i can help you out as much as possible, can you make a bullet list of all the stats you were writing custom metrics for?

I can show you which metrics already exist in mine and help you write the rest then :)

My pull request comment should be up to date with the metrics and what I need to still do.

I'll continue working on my version and evolve it towards your structure so it should be easier to combine them.

In the end is it easiest for me to create a brand new fork from your merged changes and re-apply my code to that? I'm guessing there is no way to update my fork to the latest main?

I'm also considering doing a smaller pull request to make the script more robust. Adding a way to log errors so in the future we can query the almanac data to see why things broke. At the moment an error causes almanac = null !

rviscomi commented 4 years ago

@Tiggerito GitHub tip: use "Start a review" to batch your comments (and their subsequent email notifications) into one 😅

Tiggerito commented 4 years ago

@Tiggerito GitHub tip: use "Start a review" to batch your comments (and their subsequent email notifications) into one 😅

Should I have done the start a review on the first comment then? 🤭

foxdavidj commented 4 years ago

Yeah, but no worries. This is great feedback, I'll get to making the changes.

Can we manually fix things on a merge or should we continue with the idea of me re-applying my changes after this is merged?

What i'd recommend is once my PR is merged into master, you can rebase your branch on top of the new master. Which will essentially replay all of your changes on top of mine :+1:

foxdavidj commented 4 years ago

One additional change i may make is editing '10.5' to extract all the json-ld schema data and rename the metric accordingly

Tiggerito commented 4 years ago

Yeah, but no worries. This is great feedback, I'll get to making the changes.

Can we manually fix things on a merge or should we continue with the idea of me re-applying my changes after this is merged?

What i'd recommend is once my PR is merged into master, you can rebase your branch on top of the new master. Which will essentially replay all of your changes on top of mine 👍

Cool. I get to try another new thing rebasing.

Tiggerito commented 4 years ago

One additional change i may make is editing '10.5' to extract all the json-ld schema data and rename the metric accordingly

I completely rewrite that one 🤯

Tiggerito commented 4 years ago

$WPT_BODIES worked really well. I now have data from the headers, raw html and the rendered html. I think I've implemented everything I can. Now testing.

hemanth commented 4 years ago

I feel, we should be pretty good with these metrics!

Thank you @obto! 🙏

foxdavidj commented 4 years ago

All of @Tiggerito's feedback has been resolved. Thanks!!

I've tested the script on 13 large sites, all completed successfully, and i documented the results here: https://docs.google.com/spreadsheets/d/1FliIat7uXauPVCpWcl1GU-InAOWXW7e7HHJuv4c7URQ/edit?usp=sharing

The last order of business we have is deciding if we're alright with increasing the size of the JSON file by exporting many more nodes (img, and a11y nodes particularly).

After analyzing the results, most generate around 35KB of data, and the resulting JSON file is still 10-30% the size of the firstHtml file. But as you can see from the results... some large sites like Amazon, who have tons of images and a11y nodes, make quite a large file (171kb).

These larger files will make querying more expensive, but the breadth and depth of what we can query without needing more custom metrics is fantastic.

I think the trade-off is worth it, but what do you think @rviscomi?

Tiggerito commented 4 years ago

All of @Tiggerito's feedback has been resolved. Thanks!!

I've tested the script on 13 large sites, all completed successfully, and i documented the results here: https://docs.google.com/spreadsheets/d/1FliIat7uXauPVCpWcl1GU-InAOWXW7e7HHJuv4c7URQ/edit?usp=sharing

The last order of business we have is deciding if we're alright with increasing the size of the JSON file by exporting many more nodes (img, and a11y nodes particularly).

After analyzing the results, most generate around 35KB of data, and the resulting JSON file is still 10-30% the size of the firstHtml file. But as you can see from the results... some large sites like Amazon, who have tons of images and a11y nodes, make quite a large file (171kb).

These larger files will make querying more expensive, but the breadth and depth of what we can query without needing more custom metrics is fantastic.

I think the trade-off is worth it, but what do you think @rviscomi?

I was thinking about this the other day. The almanac property seems to be quite a small part of the payload. Is it possible to trim the payload down to just what is used?

rviscomi commented 4 years ago

The last order of business we have is deciding if we're alright with increasing the size of the JSON file by exporting many more nodes (img, and a11y nodes particularly).

After analyzing the results, most generate around 35KB of data, and the resulting JSON file is still 10-30% the size of the firstHtml file. But as you can see from the results... some large sites like Amazon, who have tons of images and a11y nodes, make quite a large file (171kb).

These larger files will make querying more expensive, but the breadth and depth of what we can query without needing more custom metrics is fantastic.

I think the trade-off is worth it, but what do you think @rviscomi?

Yeah the size increase is concerning. Beyond these few popular sites, the edge cases are where we're going to get hit the hardest. Looking at the 2019 Markup chapter results, the biggest page has over a million elements on it. I worry about how pages like that would affect the test durability and storage/query costs.

Tiggerito commented 4 years ago

The last order of business we have is deciding if we're alright with increasing the size of the JSON file by exporting many more nodes (img, and a11y nodes particularly). After analyzing the results, most generate around 35KB of data, and the resulting JSON file is still 10-30% the size of the firstHtml file. But as you can see from the results... some large sites like Amazon, who have tons of images and a11y nodes, make quite a large file (171kb). These larger files will make querying more expensive, but the breadth and depth of what we can query without needing more custom metrics is fantastic. I think the trade-off is worth it, but what do you think @rviscomi?

Yeah the size increase is concerning. Beyond these few popular sites, the edge cases are where we're going to get hit the hardest. Looking at the 2019 Markup chapter results, the biggest page has over a million elements on it. I worry about how pages like that would affect the test durability and storage/query costs.

We could include array/data size limits to try and protect us.

e.g. I only output the first heading and crop it at 100 characters. There have been cases where the heading wraps around the whole pages content!

rockeynebhwani commented 4 years ago

@obto - Looks like after all, I will need custom metric for App links. I think following should do. Are you able to add this in your PR? @rviscomi - I am not sure if you ended up pursuing the idea of using 'fetch' in custom metric for other things as per our slack discussion but this seems to be working as far as I can see.

[AndroidAppLinks]
return fetch('/.well-known/assetlinks.json').then(function(r) {
      if(!r.redirected && r.status === 200) {
         return "true";
       }else{
         return "false";
      };
     });

[iOSUniveralLinks]
return fetch('/.well-known/apple-app-site-association').then(function(r) {
     if(!r.redirected && r.status === 200) {
         return "true";
       }else{
         return "false";
      };
     });

I have tested this on few sites. Status code of 200 (without any re-directs) will mean site is using App deep links. Any other status will suggest, site is not using App deep links.

Site Scenario WebPageTest Link
https://direct.asda.com/ Has AndroidAppLinks but no iOSUniversalLinks https://www.webpagetest.org/custom_metrics.php?test=200726_JB_7502ce8407c36d69d48a9013b81a57d4&run=1&cached=0
https://www.myntra.com Has both AndroidAppLinks and iOSUniversalLinks https://www.webpagetest.org/custom_metrics.php?test=200726_KV_049c75301d9ecc4a06c374870436e706&run=1&cached=0
https://www.johnlewis.com Both AndroidAppLinks and iOSUniversalLinks not present https://www.webpagetest.org/custom_metrics.php?test=200726_WJ_0f97e04403ce27ee312f1993ce33b087&run=1&cached=0
https://www.burberry.com/ Both AndroidAppLinks and iOSUniversalLinks not present and this site does re-direct https://www.webpagetest.org/custom_metrics.php?test=200726_NN_53f746a80422d95c2fde58f906cde5fa&run=1&cached=0
Tiggerito commented 4 years ago

@obto - Looks like after all, I will need custom metric for App links. I think following should do. Are you able to add this in your PR? @rviscomi - I am not sure if you ended up pursuing the idea of using 'fetch' in custom metric for other things as per our slack discussion but this seems to be working as far as I can see.

Oooo. The SEO chapter would love to be able to test the robots.txt file. Would I be allowed to do that?

Tiggerito commented 4 years ago

@obto - Looks like after all, I will need custom metric for App links. I think following should do. Are you able to add this in your PR? @rviscomi - I am not sure if you ended up pursuing the idea of using 'fetch' in custom metric for other things as per our slack discussion but this seems to be working as far as I can see.

Oooo. The SEO chapter would love to be able to test the robots.txt file. Would I be allowed to do that?

I've just submitted a new custom metrics to my pull request: robots_txt.js. It uses the fetch idea to gather some basic data from the robots.txt file for a domain. Would be nice if that can be included.

Tiggerito commented 4 years ago

I checked a few of the "Metrics unable to create a function for".

As my pull request will add access to the raw html we can do the text searches and regexes that these queries need. It looks like just a bunch of patterns we would need to flag as true or false.

I could add them to my pull request?

foxdavidj commented 4 years ago

Main change here is the addition of filtering options for parseNodes and parseNode. The code documents the possibilities and should be self-explanatory.

Besides that there are 2 other changes:

  1. track elements are now being tracked in the video metric
  2. nodes_with_aria is removed. It has too high of a chance of bloating, and doesn't provide much value since a11y's queries are mostly positional focused (what element was this one wrapped inside of, etc...)

After testing on Amazon again, the JSON size is 30.3KB instead of 171KB

Ready for review @rviscomi

foxdavidj commented 4 years ago

@rviscomi Which comments? I should have resolved, or left comments on all of them.

I haven't retested the script on all of the sites again. I can do that when I get up tomorrow

rviscomi commented 4 years ago

@rviscomi Which comments? I should have resolved, or left comments on all of them.

Scrolling through https://github.com/HTTPArchive/legacy.httparchive.org/pull/172/files, there are about half a dozen unresolved comments. Please close out anything that doesn't need to be addressed.

foxdavidj commented 4 years ago

Huh, i never saw those while looking through before. I should be able to quickly resolve all of these

foxdavidj commented 4 years ago

Updated results: https://docs.google.com/spreadsheets/d/1FliIat7uXauPVCpWcl1GU-InAOWXW7e7HHJuv4c7URQ/edit?usp=sharing

The median JSON size has been reduced by half, and the Avg size is 3x smaller

foxdavidj commented 4 years ago

@rviscomi Removed the overlapping function. Think we're good to go?

rviscomi commented 4 years ago

LGTM! 🚀