HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
20 stars 21 forks source link

Extending the HTTPArchive Data #112

Open nrllh opened 8 months ago

nrllh commented 8 months ago

I'm curious about the possibility of enhancing our current dataset by incorporating additional data points.

These types of data are often crucial for web measurement studies. Do you think it's feasible to enrich our data with these?

rviscomi commented 7 months ago

@pmeenan @tunetheweb any thoughts on the DNS question?

As for JS calls, we've tried to capture that info for specific functions of interest with the observers.js script, but it was breaking some pages' functionality. I never got the time to thoroughly debug it. My hunch is that we'll need to rewrite it to use Proxy instead of Object.defineProperty. Not sure if timelineStack does what we need. Do you have an example of its output?

pmeenan commented 7 months ago

What dns resolution information are you looking for? We have the basic timings from Chrome itself but we also collect dns information for the origin of the main page (CNAME's, authoratative DNS server list and PTR records). They are stored in the page-levej JSON as base_page_ip_ptr, base_page_cname, and base_page_dns_server.

We have full access to the netlog so we can capture anything the Chrome does that is DNS-related but it's important to remember that it is being measured in a lab environment so we only have visibility into the DNS path that we are using.

nrllh commented 7 months ago

@rviscomi Unfortunately, I don't have a specific example; I found that API in the documentation.

Ideally, we should have this information at the request level. Please review this subset, where we have DNS-related information for each request. We can also limit the scope to the origin level, since the results for foo.com/bar1and foo.com/bar2 are the same, but not for bar.foo.com. This aspect helps to understand various techniques within the ecosystem. For example, some players circumvent blocking of ad blockers by exploiting CNAME records, and for cookie syncing.

max-ostapenko commented 6 months ago

@pmeenan in the netlog I see a dns overview in HOST_RESOLVER tasks. For example script request to consent.cookiebot.com contains:

{
  "aliases": [
    "consent.cookiebot.com",
    "consent.cookiebot.com-v2.edgekey.net",
    "e110990.dsca.akamaiedge.net"
  ],
  "canonical_names": [
    "e110990.dsca.akamaiedge.net"
  ],
  "endpoint_metadatas": [],
  "expiration": "13358745242163846",
  "host_ports": [],
  "hostname_results": [],
  "ip_endpoints": [
    {
      "endpoint_address": "88.221.221.75",
      "endpoint_port": 0
    },
    {
      "endpoint_address": "88.221.221.147",
      "endpoint_port": 0
    }
  ],
  "text_records": []
}

canonical_names is the best fit for privacy analysis. Would be nice to have it in requests (or at least aggregated under page data). Please point to, if public, where this extension can be added.

max-ostapenko commented 6 months ago

@pmeenan I imagine the process to look like this:

  1. net log 1.1 save to json/variable 1.2. from the log parse $.events where type=5 (HOST_RESOLVER_MANAGER_CACHE_HIT) or 18 (HOST_RESOLVER_DNS_TASK) 1.3. out of aliases and canonical_names construct a map:
    cname_by_alias.set('consent.cookiebot.com', 'e110990.dsca.akamaiedge.net')
  2. use the map to extend WPT_REQUESTS:
    request.cname = cname_by_alias.get(request.origin)

You think it's feasible?

Or I could help with the implementation, just don't know where to start.

pmeenan commented 6 months ago

Sorry about that - got distracted by some other stuff for a bit. I just added all of the info from HOST_RESOLVER_DNS_TASK to _dns_info in the request records that include the dns timings. There shouldn't be HOST_RESOLVER_MANAGER_CACHE_HIT in the case of the agents since they start with a clean state/profile (I tested www.google.com just to be sure there weren't any origins the browser may have cached).

I can add the cache hit case as well if it is needed, but it makes sense to only include the DNS info when a DNS request went out on the wire (at least from how WebPageTest does things).

If you're curious, the change was really small: https://github.com/HTTPArchive/wptagent/commit/8332636743ee12c1315db060c98e5289d3d2aca2

It should be live on the HTTP Archive instance now.

Sample:

"_dns_info": {
    "secure": false,
    "transactions_needed": [
        {
            "dns_query_type": "A"
        },
        {
            "dns_query_type": "HTTPS"
        }
    ],
    "results": {
        "aliases": [
            "www.cloudflare.com"
        ],
        "canonical_names": [
            "www.cloudflare.com"
        ],
        "endpoint_metadatas": [
            {
                "endpoint_metadata_value": {
                    "ech_config_list": "",
                    "supported_protocol_alpns": [
                        "h3",
                        "h2",
                        "http/1.1"
                    ],
                    "target_name": "www.cloudflare.com"
                },
                "endpoint_metadata_weight": 1
            }
        ],
        "expiration": "13362070409287164",
        "host_ports": [],
        "hostname_results": [],
        "ip_endpoints": [
            {
                "endpoint_address": "104.16.124.96",
                "endpoint_port": 0
            },
            {
                "endpoint_address": "104.16.123.96",
                "endpoint_port": 0
            }
        ],
        "text_records": []
    }
},
max-ostapenko commented 6 months ago

I wasn't sure about DNS cache, thanks for clarifying and extending the object.

@pmeenan Would you be able to add this info to $WPT_REQUESTS? Otherwise I'm not sure how to access it.

pmeenan commented 6 months ago

Right now it would be in the requests data directly in bigquery. If you need to process it to create a custom metric on the client I can put it into $WPT_REQUESTS as well but $WPT_REQUESTS doesn't have good detail for which request triggered a DNS lookup so I'd probably have to add it to every request that matches the host (so there would be duplicate DNS data for every request to the same host).

The other option would be to expose it in $WPT_DNS and provide a host-name-indexed dictionary of the info records.

pmeenan commented 6 months ago

I decided to go with $WPT_DNS instead of adding it to every request. It is a dictionary of the dns_info entries indexed by origin (since the DNS lookups include protocol not just hostname).

Should be live on the HA instance now.

Trying with a custom metric for cloudflare.com:

[dns_info]
return $WPT_DNS;

Returns:

{
  "https://www.cloudflare.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "www.cloudflare.com"
      ],
      "canonical_names": [
        "www.cloudflare.com"
      ],
      "endpoint_metadatas": [
        {
          "endpoint_metadata_value": {
            "ech_config_list": "",
            "supported_protocol_alpns": [
              "h3",
              "h2",
              "http/1.1"
            ],
            "target_name": "www.cloudflare.com"
          },
          "endpoint_metadata_weight": 1
        }
      ],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.16.124.96",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.16.123.96",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://cf-assets.www.cloudflare.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "cf-assets.www.cloudflare.com"
      ],
      "canonical_names": [
        "cf-assets.www.cloudflare.com"
      ],
      "endpoint_metadatas": [
        {
          "endpoint_metadata_value": {
            "ech_config_list": "",
            "supported_protocol_alpns": [
              "h3",
              "h2",
              "http/1.1"
            ],
            "target_name": "cf-assets.www.cloudflare.com"
          },
          "endpoint_metadata_weight": 1
        }
      ],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.16.123.96",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.16.124.96",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://static.cloudflareinsights.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "static.cloudflareinsights.com"
      ],
      "canonical_names": [
        "static.cloudflareinsights.com"
      ],
      "endpoint_metadatas": [
        {
          "endpoint_metadata_value": {
            "ech_config_list": "",
            "supported_protocol_alpns": [
              "h2",
              "http/1.1"
            ],
            "target_name": "static.cloudflareinsights.com"
          },
          "endpoint_metadata_weight": 1
        }
      ],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.16.79.73",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.16.80.73",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://performance.radar.cloudflare.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "performance.radar.cloudflare.com"
      ],
      "canonical_names": [
        "performance.radar.cloudflare.com"
      ],
      "endpoint_metadatas": [
        {
          "endpoint_metadata_value": {
            "ech_config_list": "",
            "supported_protocol_alpns": [
              "h3",
              "h2",
              "http/1.1"
            ],
            "target_name": "performance.radar.cloudflare.com"
          },
          "endpoint_metadata_weight": 1
        }
      ],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.18.30.78",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.18.31.78",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://fastly.cedexis-test.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "fastly.cedexis-test.com",
        "prod.cedexis-ssl.map.fastly.net"
      ],
      "canonical_names": [
        "prod.cedexis-ssl.map.fastly.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "151.101.2.6",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.66.6",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.130.6",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.194.6",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://www.google.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "www.google.com"
      ],
      "canonical_names": [
        "www.google.com"
      ],
      "endpoint_metadatas": [
        {
          "endpoint_metadata_value": {
            "ech_config_list": "",
            "supported_protocol_alpns": [
              "h2",
              "h3",
              "http/1.1"
            ],
            "target_name": "www.google.com"
          },
          "endpoint_metadata_weight": 1
        }
      ],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "172.253.62.147",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "172.253.62.99",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "172.253.62.106",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "172.253.62.104",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "172.253.62.105",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "172.253.62.103",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://ptcfc.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "ptcfc.com"
      ],
      "canonical_names": [
        "ptcfc.com"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.16.80.67",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.16.81.67",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://cedexis-test.akamaized.net": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "a1851.dscw121.akamai.net",
        "cedexis-test.akamaized.net"
      ],
      "canonical_names": [
        "a1851.dscw121.akamai.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309776",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "23.200.3.235",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "23.200.3.238",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://benchmark.1e100cdn.net": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "benchmark.1e100cdn.net"
      ],
      "canonical_names": [
        "benchmark.1e100cdn.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "35.190.26.57",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://p29.cedexis-test.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "d1inq1x5xtur5k.cloudfront.net",
        "p29.cedexis-test.com"
      ],
      "canonical_names": [
        "d1inq1x5xtur5k.cloudfront.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "18.165.98.99",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "18.165.98.12",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "18.165.98.15",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "18.165.98.126",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://essl-cdxs.edgekey.net": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "e31668.a.akamaiedge.net",
        "essl-cdxs.edgekey.net"
      ],
      "canonical_names": [
        "e31668.a.akamaiedge.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.70.120.186",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.70.120.211",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://testingcf.jsdelivr.net": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "testingcf.jsdelivr.net",
        "testingcf.jsdelivr.net.cdn.cloudflare.net"
      ],
      "canonical_names": [
        "testingcf.jsdelivr.net.cdn.cloudflare.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309776",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.18.186.31",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.18.187.31",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://fastly.jsdelivr.net": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "fastly.jsdelivr.net",
        "jsdelivr.map.fastly.net"
      ],
      "canonical_names": [
        "jsdelivr.map.fastly.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "151.101.1.229",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.65.229",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.129.229",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.193.229",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://p16999.cedexis-test.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "cedexis-ssl.wpc.apr-b30d.edgecastdns.net",
        "cs482.wpc.edgecastcdn.net",
        "p16999.cedexis-test.com"
      ],
      "canonical_names": [
        "cs482.wpc.edgecastcdn.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "192.229.210.104",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://stackpath-map3.cedexis-test.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "cds.x7t9n8c4.hwcdn.net",
        "stackpath-map3.cedexis-test.com"
      ],
      "canonical_names": [
        "cds.x7t9n8c4.hwcdn.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309776",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "69.16.175.42",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "69.16.175.10",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://p17003.cedexis-test.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "cedexis-1.s.llnwi.net",
        "cedexis-1.vo.llnwd.net",
        "p17003.cedexis-test.com"
      ],
      "canonical_names": [
        "cedexis-1.s.llnwi.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "69.28.134.67",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "69.28.134.65",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://serverless-benchmarks-js.compute-pipe.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "serverless-benchmarks-js.compute-pipe.com"
      ],
      "canonical_names": [
        "serverless-benchmarks-js.compute-pipe.com"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.18.1.248",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.18.0.248",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://uniquely-peaceful-hagfish.edgecompute.app": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "ecp.map.fastly.net",
        "uniquely-peaceful-hagfish.edgecompute.app"
      ],
      "canonical_names": [
        "ecp.map.fastly.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309776",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "151.101.1.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.65.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.129.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.193.51",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://serverless-benchmarks-rust.compute-pipe.com": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "serverless-benchmarks-rust.compute-pipe.com"
      ],
      "canonical_names": [
        "serverless-benchmarks-rust.compute-pipe.com"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "104.18.0.248",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "104.18.1.248",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  },
  "https://exactly-huge-arachnid.edgecompute.app": {
    "secure": false,
    "transactions_needed": [
      {
        "dns_query_type": "A"
      },
      {
        "dns_query_type": "HTTPS"
      }
    ],
    "results": {
      "aliases": [
        "ecp.map.fastly.net",
        "exactly-huge-arachnid.edgecompute.app"
      ],
      "canonical_names": [
        "ecp.map.fastly.net"
      ],
      "endpoint_metadatas": [],
      "expiration": "13362084114309777",
      "host_ports": [],
      "hostname_results": [],
      "ip_endpoints": [
        {
          "endpoint_address": "151.101.1.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.65.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.129.51",
          "endpoint_port": 0
        },
        {
          "endpoint_address": "151.101.193.51",
          "endpoint_port": 0
        }
      ],
      "text_records": []
    }
  }
}
max-ostapenko commented 6 months ago

@pmeenan In BigQuery is it payload (50TB as in May 2024) column or summary (10TB) that will have DNS info? In case of latter I would prefer to query directly in BQ to avoid storing additional data in custom metrics column (8.5TB).

pmeenan commented 6 months ago

payload unfortunately.