HTTPArchive / cwv-tech-report

Core Web Vitals Technology Report
https://cwvtech.report
32 stars 2 forks source link

Automate populating tech report SQL #36

Closed tunetheweb closed 2 months ago

tunetheweb commented 2 months ago

https://github.com/HTTPArchive/cwv-tech-report/blob/main/sql/monthly.sql

As discussed @pmeenan this should run once we have the CrUX data so at the start of the crawl.

Would be nice to have monitoring to know if it succeeded or timed out, if that's possible too.

pmeenan commented 2 months ago

@tunetheweb for DECLARE _YYYYMMDD DATE DEFAULT '2024-02-01';, Is the date for the current month or the previous month?

tunetheweb commented 2 months ago

The previous month, as it needs the HTTP Archive data and the CrUX data and the latter is only released on the second Tuesday of the following month.

tunetheweb commented 2 months ago

Tagging in @max-ostapenko who GCP and therefore might have a better suggestion of how to run this, in Dataform workflow, where we can see it logs it's progress, and issue, and then presumable be able to kick off the next step to populate the firestore caches after that (in future when Martin has that set up).

max-ostapenko commented 2 months ago

/assign @max-ostapenko so I can get back to this

pmeenan commented 2 months ago

@max-ostapenko is the trigger and full automation hooked up now (including triggering off of the "done" pubsub message) or was this closed prematurely?

max-ostapenko commented 2 months ago

@pmeenan yeah, the PR closed automatically - reopened.

So you suggest we keep using a PubSub done event as a trigger? Do we have a started event, when CrUX data becomes available?

pmeenan commented 2 months ago

We currently publish a "done" message to the crawl-complete queue. I'm happy to post a started message or use a different queue for status or use other triggers. I can also kick off the workflows directly from the crawl manager. Whatever is easiest.

max-ostapenko commented 2 months ago

@pmeenan I see there is crux-updated topic. Can I rely on this one to trigger tech report tables? What sends a message here?

pmeenan commented 2 months ago

I don't know. I'm not sure it is even hooked up. Can you tell if any messages were published in the last month? If not, I could publish to it.

On Wed, Aug 28, 2024 at 7:54 PM Max Ostapenko @.***> wrote:

@pmeenan https://github.com/pmeenan I see there is crux-updated topic. Can I rely on this one to trigger tech report tables? What sends a message here?

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2316434710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBMZ4EHF5KAOM6A2BQLZTZPMFAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWGQZTINZRGA . You are receiving this because you were mentioned.Message ID: @.***>

max-ostapenko commented 2 months ago

A message on 16 Aug, 2am CEST. There is also one subscriber gcf-crux-to-gcs2-crux-updated.

I created a workflow cwv-tech-report, it's just missing a pubsub trigger. You can test it and see the updated table.

max-ostapenko commented 2 months ago

@pmeenan and is there an automated trigger for the crawl itself? If not, what is it dependent on? (my guess was chrome-ux-report.all.202407 availability)

If manual, how long does it usually take between dependencies available and crawl start?

pmeenan commented 2 months ago

The main crawl controller (crawl.py) checks the crux bigquery table metadata hourly for when the last time it was updated was. ~6 hours after the update (just to be safe) it kicks off the crawl.

On Thu, Aug 29, 2024 at 3:09 PM Max Ostapenko @.***> wrote:

@pmeenan https://github.com/pmeenan and is there an automated trigger for the crawl itself? If not, what is it dependent on? (my guess was chrome-ux-report.all.202407 availability)

If manual, how long does it usually take between dependencies available and crawl start?

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2318665526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBLO23QKY6J7IWG4OMTZT5WXRAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGY3DKNJSGY . You are receiving this because you were mentioned.Message ID: @.***>

max-ostapenko commented 2 months ago

OK, I see issue relying on crawl.py. CrUX chrome-ux-report.experimental.global was created on Aug 13, 2024, 8:20:04 AM UTC+2.

Tech report depends on the following tables:

device_summary table becomes available 10h later - the triggered workflow will fail.

And I have doubts that crawl controller would be a good place for non-related checks. Do you have alternative solutions for this trigger in place?

tunetheweb commented 2 months ago

I guess we need another poller (similar to that used by crawl.py) to kick this off once the last tables has completed?

pmeenan commented 2 months ago

I don't have anything currently in place but it would be trivial to pull the logic out of crawl.py into a separate script specific to this workflow that checks for all of the necessary tables to be ready before starting the workflow (or sending a pubsub message). It would run on a cron job of its own.

On Thu, Aug 29, 2024 at 3:41 PM Max Ostapenko @.***> wrote:

OK, I see issue relying on crawl.py https://github.com/HTTPArchive/crawl/blob/main/crawl.py#L357. CrUX chrome-ux-report.experimental.global was created on Aug 13, 2024, 8:20:04 AM UTC+2.

Tech report depends on the following tables:

  • chrome-ux-report.materialized.country_summary last updated on Aug 13, 2024, 10:08:32 AM UTC+2
  • chrome-ux-report.materialized.device_summary last updated on Aug 13, 2024, 6:15:17 PM UTC+2

device_summary table becomes available 10h later - the triggered workflow will fail.

And I have doubts that crawl controller would be a good place for non-related checks. Do you have alternative solutions for this trigger in place?

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/cwv-tech-report/issues/36#issuecomment-2318773537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBPLBVHAYHKSLNOASL3ZT52QFAVCNFSM6AAAAABM2P6A7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYG43TGNJTG4 . You are receiving this because you were mentioned.Message ID: @.***>

pmeenan commented 2 months ago

I guess we need another poller (similar to that used by crawl.py) to kick this off once the last tables has completed?

lol - yeah, what Barry said.

max-ostapenko commented 2 months ago

ok, will add these triggers

max-ostapenko commented 2 months ago

@pmeenan please, could you adjust the 'crawl-complete' message published, so that it sends a JSON:

{
  "name": "crawl_complete",
  ...
}

Here is a line trying to get the trigger name.

pmeenan commented 2 months ago

Sorry, just looking at this now. Sure, I can change the payload. Right now it sends the path to the bucket where the HARs are written to but that's really not important now that we stream the data directly.