Closed tunetheweb closed 2 months ago
@tunetheweb for DECLARE _YYYYMMDD DATE DEFAULT '2024-02-01';
, Is the date for the current month or the previous month?
The previous month, as it needs the HTTP Archive data and the CrUX data and the latter is only released on the second Tuesday of the following month.
Tagging in @max-ostapenko who GCP and therefore might have a better suggestion of how to run this, in Dataform workflow, where we can see it logs it's progress, and issue, and then presumable be able to kick off the next step to populate the firestore caches after that (in future when Martin has that set up).
/assign @max-ostapenko so I can get back to this
@max-ostapenko is the trigger and full automation hooked up now (including triggering off of the "done" pubsub message) or was this closed prematurely?
@pmeenan yeah, the PR closed automatically - reopened.
So you suggest we keep using a PubSub done
event as a trigger?
Do we have a started
event, when CrUX data becomes available?
We currently publish a "done" message to the crawl-complete
queue. I'm happy to post a started message or use a different queue for status or use other triggers. I can also kick off the workflows directly from the crawl manager. Whatever is easiest.
@pmeenan I see there is crux-updated
topic.
Can I rely on this one to trigger tech report tables? What sends a message here?
I don't know. I'm not sure it is even hooked up. Can you tell if any messages were published in the last month? If not, I could publish to it.
On Wed, Aug 28, 2024 at 7:54 PM Max Ostapenko @.***> wrote:
@pmeenan https://github.com/pmeenan I see there is crux-updated topic. Can I rely on this one to trigger tech report tables? What sends a message here?
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2316434710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBMZ4EHF5KAOM6A2BQLZTZPMFAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWGQZTINZRGA . You are receiving this because you were mentioned.Message ID: @.***>
A message on 16 Aug, 2am CEST. There is also one subscriber gcf-crux-to-gcs2-crux-updated.
I created a workflow cwv-tech-report, it's just missing a pubsub trigger. You can test it and see the updated table.
@pmeenan and is there an automated trigger for the crawl itself?
If not, what is it dependent on? (my guess was chrome-ux-report.all.202407
availability)
If manual, how long does it usually take between dependencies available and crawl start?
The main crawl controller (crawl.py) checks the crux bigquery table metadata hourly for when the last time it was updated was. ~6 hours after the update (just to be safe) it kicks off the crawl.
On Thu, Aug 29, 2024 at 3:09 PM Max Ostapenko @.***> wrote:
@pmeenan https://github.com/pmeenan and is there an automated trigger for the crawl itself? If not, what is it dependent on? (my guess was chrome-ux-report.all.202407 availability)
If manual, how long does it usually take between dependencies available and crawl start?
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2318665526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBLO23QKY6J7IWG4OMTZT5WXRAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGY3DKNJSGY . You are receiving this because you were mentioned.Message ID: @.***>
OK, I see issue relying on crawl.py.
CrUX chrome-ux-report.experimental.global
was created on Aug 13, 2024, 8:20:04 AM UTC+2.
Tech report depends on the following tables:
chrome-ux-report.materialized.country_summary
last updated on Aug 13, 2024, 10:08:32 AM UTC+2chrome-ux-report.materialized.device_summary
last updated on Aug 13, 2024, 6:15:17 PM UTC+2device_summary
table becomes available 10h later - the triggered workflow will fail.
And I have doubts that crawl controller would be a good place for non-related checks. Do you have alternative solutions for this trigger in place?
I guess we need another poller (similar to that used by crawl.py) to kick this off once the last tables has completed?
I don't have anything currently in place but it would be trivial to pull the logic out of crawl.py into a separate script specific to this workflow that checks for all of the necessary tables to be ready before starting the workflow (or sending a pubsub message). It would run on a cron job of its own.
On Thu, Aug 29, 2024 at 3:41 PM Max Ostapenko @.***> wrote:
OK, I see issue relying on crawl.py https://github.com/HTTPArchive/crawl/blob/main/crawl.py#L357. CrUX chrome-ux-report.experimental.global was created on Aug 13, 2024, 8:20:04 AM UTC+2.
Tech report depends on the following tables:
- chrome-ux-report.materialized.country_summary last updated on Aug 13, 2024, 10:08:32 AM UTC+2
- chrome-ux-report.materialized.device_summary last updated on Aug 13, 2024, 6:15:17 PM UTC+2
device_summary table becomes available 10h later - the triggered workflow will fail.
And I have doubts that crawl controller would be a good place for non-related checks. Do you have alternative solutions for this trigger in place?
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2318773537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBPLBVHAYHKSLNOASL3ZT52QFAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYG43TGNJTG4 . You are receiving this because you were mentioned.Message ID: @.***>
I guess we need another poller (similar to that used by crawl.py) to kick this off once the last tables has completed?
lol - yeah, what Barry said.
ok, will add these triggers
@pmeenan please, could you adjust the 'crawl-complete' message published, so that it sends a JSON:
{
"name": "crawl_complete",
...
}
Here is a line trying to get the trigger name.
Sorry, just looking at this now. Sure, I can change the payload. Right now it sends the path to the bucket where the HARs are written to but that's really not important now that we stream the data directly.
https://github.com/HTTPArchive/cwv-tech-report/blob/main/sql/monthly.sql
As discussed @pmeenan this should run once we have the CrUX data so at the start of the crawl.
Would be nice to have monitoring to know if it succeeded or timed out, if that's possible too.