HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
334 stars 42 forks source link

Syncing with CrUX origins broken since July #196

Closed rviscomi closed 4 years ago

rviscomi commented 4 years ago

I discovered that HTTP Archive's January 2020 dataset is actually based on the origins from the May 2019 CrUX dataset.

#standardSQL
SELECT
  month,
  COUNT(DISTINCT url) AS matching,
  (SELECT COUNT(DISTINCT url) FROM `httparchive.summary_pages.2020_01_01_*`) AS total,
  COUNT(DISTINCT url) / (SELECT COUNT(DISTINCT url) FROM `httparchive.summary_pages.2020_01_01_*`) AS pct
FROM
  `httparchive.summary_pages.2020_01_01_*`
JOIN
  (SELECT _TABLE_SUFFIX AS month, origin FROM `chrome-ux-report.all.*` GROUP BY 1,2)
ON
  url = CONCAT(origin, '/')
GROUP BY
  month
ORDER BY
  month DESC
month matching total pct
201912 3,543,367 5,638,060 62.85%
201911 3,784,646 5,638,060 67.13%
201910 3,861,812 5,638,060 68.50%
201909 4,063,960 5,638,060 72.08%
201908 4,221,138 5,638,060 74.87%
201907 4,221,279 5,638,060 74.87%
201906 4,429,930 5,638,060 78.57%
201905 5,638,060 5,638,060 100.00%
201904 4,505,869 5,638,060 79.92%
201903 4,252,904 5,638,060 75.43%
201902 3,946,804 5,638,060 70.00%
201901 3,746,554 5,638,060 66.45%
201812 3,290,654 5,638,060 58.37%
201811 3,243,256 5,638,060 57.52%
201810 2,997,375 5,638,060 53.16%
201809 2,938,831 5,638,060 52.12%
201808 2,820,693 5,638,060 50.03%
201807 2,639,681 5,638,060 46.82%
201806 2,564,259 5,638,060 45.48%
201805 2,473,843 5,638,060 43.88%
201804 2,290,644 5,638,060 40.63%
201803 1,994,691 5,638,060 35.38%
201802 1,719,115 5,638,060 30.49%
201801 1,600,701 5,638,060 28.39%
201712 1,063,023 5,638,060 18.85%
201711 723,059 5,638,060 12.82%
201710 6,736 5,638,060 0.12%

In other words, the 2020_01_01 dataset did not have 100% parity with CrUX since 201905. If the sync was working we should expect YYYMM-2 (201911) to be the most recent CrUX dataset with 100% parity.

The syncing feature was built in #133.

rviscomi commented 4 years ago

From #133, the order of operations is:

  1. Schedule two queries to run every month on the 15th and export the latest desktop and mobile CrUX URLs to the httparchive.urls.latestcrux* tables (eg desktop, mobile)
  2. Trigger a Cloud Pub/Sub event when the second query completes
  3. Handle the Pub/Sub event with a Cloud Function that exports the BigQuery tables to Cloud Storage
  4. Before each crawl, download the latest CSV files and import them into the appropriate urls MySQL tables

Step 1 and 2 appear to be working normally. The latest_crux_* tables are correctly populated with URLs corresponding to the most recent CrUX dataset, 201912. The scheduled query is properly configured to trigger a Pub/Sub topic.

image

image

The Cloud Function may not be working properly, because the CSV files on GCS were last modified on June 30, 2019.

image

The Cloud Function appears to be correctly configured to run when the crux-updated topic is published.

image

Manually testing the Cloud Function completed successfully and the last modified date of the CSV files on GCS was updated.

image

image

However from the Pub/Sub topic's perspective, it thinks that there are no subscriptions.

image

So from the Pub/Sub topic's config page, I created a new Cloud Function named crux-to-gcs2 with the same code. Triggering a test message for the topic correctly invoked the Cloud Function and updated the last modified date on GCS.

We just missed the scheduled cron job to load the URLs into the test server's database for the February crawl so I'm running that script manually now. (~$ ./sync_crux.sh)

I'll check on the flow again in a month to make sure it's still running smoothly.