HTTPArchive / cwv-tech-report

Core Web Vitals Technology Report
https://cwvtech.report
28 stars 2 forks source link

Group primary/secondary pages as of May 2022 #19

Closed rviscomi closed 2 years ago

rviscomi commented 2 years ago

Now that HTTP Archive has technology detections from secondary pages in the May 2022 dataset, we can make use of that to improve the coverage/accuracy of detections in the CWV Tech Report.

CrUX data is aggregated at the origin level. For an origin to be considered to adopt a given technology, it must be found on at least one of its pages. For stats like the median Lighthouse scores and page weights, those may continue to be aggregated at page-level granularity.

Open to alternate suggestions to avoid over-representing sites that use a technology on multiple pages. For example, would it be more meaningful if we averaged the page-level stats together before taking the origin-level medians?

The httparchive.all dataset is still being developed but we should eventually migrate the monthly query to pull from there instead to simplify all the JOIN operations.

Note: This query won't return any results until the 202205 CrUX dataset is released on June 14. For testing, you can set the CrUX dataset to 202204 with HA data from 2022_05_12 (home+secondary pages).

rviscomi commented 2 years ago

Just worried that COUNT(DISTINCT root_page_url) kills query performance compared to COUNT(0). May need to find a different (more complicated) approach.

rviscomi commented 2 years ago

Went with the averaging per origin approach as it was necessary to pre-aggregate the data by origin to avoid the COUNT DISTINCT performance issue described above. The query is timing out after 6 hours otherwise.