Syncing with CrUX origins broken since July

HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine

Apache License 2.0

334 stars 42 forks source link

#standardSQL SELECT month, COUNT(DISTINCT url) AS matching, (SELECT COUNT(DISTINCT url) FROM `httparchive.summary_pages.2020_01_01_*`) AS total, COUNT(DISTINCT url) / (SELECT COUNT(DISTINCT url) FROM `httparchive.summary_pages.2020_01_01_*`) AS pct FROM `httparchive.summary_pages.2020_01_01_*` JOIN (SELECT _TABLE_SUFFIX AS month, origin FROM `chrome-ux-report.all.*` GROUP BY 1,2) ON url = CONCAT(origin, '/') GROUP BY month ORDER BY month DESC

month	matching	total	pct
201912	3,543,367	5,638,060	62.85%
201911	3,784,646	5,638,060	67.13%
201910	3,861,812	5,638,060	68.50%
201909	4,063,960	5,638,060	72.08%
201908	4,221,138	5,638,060	74.87%
201907	4,221,279	5,638,060	74.87%
201906	4,429,930	5,638,060	78.57%
201905	5,638,060	5,638,060	100.00%
201904	4,505,869	5,638,060	79.92%
201903	4,252,904	5,638,060	75.43%
201902	3,946,804	5,638,060	70.00%
201901	3,746,554	5,638,060	66.45%
201812	3,290,654	5,638,060	58.37%
201811	3,243,256	5,638,060	57.52%
201810	2,997,375	5,638,060	53.16%
201809	2,938,831	5,638,060	52.12%
201808	2,820,693	5,638,060	50.03%
201807	2,639,681	5,638,060	46.82%
201806	2,564,259	5,638,060	45.48%
201805	2,473,843	5,638,060	43.88%
201804	2,290,644	5,638,060	40.63%
201803	1,994,691	5,638,060	35.38%
201802	1,719,115	5,638,060	30.49%
201801	1,600,701	5,638,060	28.39%
201712	1,063,023	5,638,060	18.85%
201711	723,059	5,638,060	12.82%
201710	6,736	5,638,060	0.12%

month

matching

total

pct

201912

3,543,367

5,638,060

62.85%

201911

3,784,646

5,638,060

67.13%

201910

3,861,812

5,638,060

68.50%

201909

4,063,960

5,638,060

72.08%

201908

4,221,138

5,638,060

74.87%

201907

4,221,279

5,638,060

74.87%

201906

4,429,930

5,638,060

78.57%

201905

5,638,060

100.00%

201904

4,505,869

5,638,060

79.92%

201903

4,252,904

5,638,060

75.43%

201902

3,946,804

5,638,060

70.00%

201901

3,746,554

5,638,060

66.45%

201812

3,290,654

5,638,060

58.37%

201811

3,243,256

5,638,060

57.52%

201810

2,997,375

5,638,060

53.16%

201809

2,938,831

5,638,060

52.12%

201808

2,820,693

5,638,060

50.03%

201807

2,639,681

5,638,060

46.82%

201806

2,564,259

5,638,060

45.48%

201805

2,473,843

5,638,060

43.88%

201804

2,290,644

5,638,060

40.63%

201803

1,994,691

5,638,060

35.38%

201802

1,719,115

5,638,060

30.49%

201801

1,600,701

5,638,060

28.39%

201712

1,063,023

5,638,060

18.85%

201711

723,059

5,638,060

12.82%

201710

6,736

5,638,060

0.12%

From #133, the order of operations is:

Schedule two queries to run every month on the 15th and export the latest desktop and mobile CrUX URLs to the httparchive.urls.latestcrux* tables (eg desktop, mobile)
Trigger a Cloud Pub/Sub event when the second query completes
Handle the Pub/Sub event with a Cloud Function that exports the BigQuery tables to Cloud Storage
Before each crawl, download the latest CSV files and import them into the appropriate urls MySQL tables

Step 1 and 2 appear to be working normally. The latest_crux_* tables are correctly populated with URLs corresponding to the most recent CrUX dataset, 201912. The scheduled query is properly configured to trigger a Pub/Sub topic.

The Cloud Function may not be working properly, because the CSV files on GCS were last modified on June 30, 2019.

The Cloud Function appears to be correctly configured to run when the crux-updated topic is published.

Manually testing the Cloud Function completed successfully and the last modified date of the CSV files on GCS was updated.

However from the Pub/Sub topic's perspective, it thinks that there are no subscriptions.

So from the Pub/Sub topic's config page, I created a new Cloud Function named crux-to-gcs2 with the same code. Triggering a test message for the topic correctly invoked the Cloud Function and updated the last modified date on GCS.

We just missed the scheduled cron job to load the URLs into the test server's database for the February crawl so I'm running that script manually now. (~$ ./sync_crux.sh)

I'll check on the flow again in a month to make sure it's still running smoothly.

HTTPArchive / httparchive.org

Syncing with CrUX origins broken since July #196