Closed rviscomi closed 4 years ago
From #133, the order of operations is:
Step 1 and 2 appear to be working normally. The latest_crux_*
tables are correctly populated with URLs corresponding to the most recent CrUX dataset, 201912. The scheduled query is properly configured to trigger a Pub/Sub topic.
The Cloud Function may not be working properly, because the CSV files on GCS were last modified on June 30, 2019.
The Cloud Function appears to be correctly configured to run when the crux-updated
topic is published.
Manually testing the Cloud Function completed successfully and the last modified date of the CSV files on GCS was updated.
However from the Pub/Sub topic's perspective, it thinks that there are no subscriptions.
So from the Pub/Sub topic's config page, I created a new Cloud Function named crux-to-gcs2
with the same code. Triggering a test message for the topic correctly invoked the Cloud Function and updated the last modified date on GCS.
We just missed the scheduled cron job to load the URLs into the test server's database for the February crawl so I'm running that script manually now. (~$ ./sync_crux.sh
)
I'll check on the flow again in a month to make sure it's still running smoothly.
I discovered that HTTP Archive's January 2020 dataset is actually based on the origins from the May 2019 CrUX dataset.
In other words, the 2020_01_01 dataset did not have 100% parity with CrUX since 201905. If the sync was working we should expect YYYMM-2 (201911) to be the most recent CrUX dataset with 100% parity.
The syncing feature was built in #133.