Update report JSON exports to `crawl` dataset

max-ostapenko commented 1 month ago

For the script located in https://github.com/HTTPArchive/bigquery/tree/master/sql:

rewrite queries to a new stable schema
move the script to another currently used repo where it fits the most
trigger this step automatically upon successful BQ pipeline completion

I can help with SQL rewrite, but not familiar with nuances of these bash scripts

Related issues:

documentation https://github.com/HTTPArchive/data-pipeline/issues/21
alerts for failed reports (already implemented for failed workflows on Dataform) https://github.com/HTTPArchive/data-pipeline/issues/24

tunetheweb commented 1 month ago

They've been hacked and hacked and hacked and are now a mess. Mostly my hacking. Open to better ideas here (dataform?).

Basically they need to be able to do this:

Run monthly histograms SQLs when crawl is finished
Run longer term time series SQLs when crawl is finished
- Be able to run the time series in an incremental fashion (download existing JSON, run for any months since then, and merge them together—this lead to HUGE timing savings) which currently requires injecting additional SQL clauses (using different data formats if it's HTTP Archive legacy, all, or CrUX.
Handle different lenses (Top X, WordPress, Drupal, Magento) which currently requires injecting additional SQL clauses into the current queries. We should make this natively supported in the current SQL with an optional param.
Handle CrUX reports (monthly histograms and time series) having to run later.
Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN (example json) and be served to our site (example report) or others.
Be able to run and only run reports missing (histograms) or missing dates (time series)
Be able to force rerun (to override any existing reports). Note that for time series this means running from beginning of time so can take a long time (especially for some of the expensive ones!), but we don't do that often.
Be able to run a subset of reports (e.g. all *crux* ones, or just the numURLs one casue it was wrong and needs force rerun).

Thoughts on how best to tackle this WITHOUT hacky bash scripts?

max-ostapenko commented 1 month ago

Considering the requirement list no surprise it's a hacky one :)

We could move it dataform, but I'm not sure it will give any advantages.

How it would look in dataform:

Basically they need to be able to do this:

Run monthly histograms SQLs when crawl is finished

Assign a tag to histogram report table jobs, and include it to run on crawl_complete event

Run longer term time series SQLs when crawl is finished

Assign a tag to time series report table jobs, and include it to run on crawl_complete event

Be able to run the time series in an incremental fashion (download existing JSON, run for any months since then, and merge them together—this lead to HUGE timing savings) which currently requires injecting additional SQL clauses (using different data formats if it's HTTP Archive legacy, all, or CrUX.

I think we could be better off moving this logic to SQL completely. Having a table per each report and exporting whole table into GS after each table change. This way all the logic will stay in BQ/dataform/SQL, and downstream can be operated independently based on BQ native log events as triggers (exports can't be implemented using SQL anyway).

Handle different lenses (Top X, WordPress, Drupal, Magento) which currently requires injecting additional SQL clauses into the current queries. We should make this natively supported in the current SQL with an optional param.

Seems like a templating scenario, can be implemented with nodeJS in Dataform.

Handle CrUX reports (monthly histograms and time series) having to run later.

Different tag, different trigger

Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN (example json) and be served to our site (example report) or others.

Related to incrementality implementation note above, this needs to be separated (Cloud Function?). Bonus point if we could have event driven integration (subscribe to BQ logs, export JSON whenever there is a successful completion of a job writing to report table)

Be able to run and only run reports missing (histograms) or missing dates (time series)

Missing why? If any jobs fail in the Dataform workflow we'll get an alert, check the execution logs, fix issues and rerun selected items manually in dev workspace. The date range and other parameter adjustments need to be done in the templates.

Be able to force rerun (to override any existing reports). Note that for time series this means running from beginning of time so can take a long time (especially for some of the expensive ones!), but we don't do that often.

Select a report in dev workspace, adjust template parameters and hit run.

Be able to run a subset of reports (e.g. all *crux* ones, or just the numURLs one casue it was wrong and needs force rerun).

Can be done using additional tags. But I hope we can get to stable runs and then UI is very helpful for manual scenarious.

max-ostapenko commented 1 month ago

Yeah, seems Dataform matches requirements well (simple and native)

So I suppose it's a manual step today? Deployed somewhere or running locally? How much time does it take?

tunetheweb commented 1 month ago

Deployed on a VM in the cloud. We do a "git pull" to release a new report and it's kicked off from cron every morning and bails early if tables not ready, and then runs each report when data is available, and then checks (but does not run) each report every other day of that month. A bit wasteful to be honest!

It takes a full day to run all reports for all lenses.

tunetheweb commented 1 month ago

Missing why?

Ideally no missing data. But sometimes it fails and I don't get run to fixing it that month. The way it works currently is it checks the last date and then runs the report with AND date > LAST_GOOD_DATE which means it automatically fills in the missing data for the time series without me having to run it twice for each month. Not essential but kinda nice feature of the current script.

max-ostapenko commented 1 month ago

Oh, so it's the queries that I see running today (with VM service account) Based on logs it takes more than 2 days (PWA scores not available while running) and 17% of jobs fail.

And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.

I would focus on rewriting queries to new all schema so that we can deprecate legacy. And if you think it will break the current process and can help rewriting bash into js templates - we can go full migration.

max-ostapenko commented 1 month ago

BTW if I understood correctly the current solution is constantly checking and trying to process missing reports without much manual interference today - it's not the case with Dataform.

When jobs fail in Dataform we'll need to fix and re-run them manually (the correct DAG being the solution). Queuing reports and retrying failed later doesn't work with this tool.

tunetheweb commented 1 month ago

Oh, so it's the queries that I see running today (with VM service account)

Yup.

Based on logs it takes more than 2 days (PWA scores not available while running)

Oh that's a good point! That no longer exists so will never complete. But it's running everyday trying. Have deleted that file so it won't be run anymore. Will delete from GitHub too.

and 17% of jobs fail.

As well as odd examples like above, CrUX ones will fail until that data is ready. And some don't work with lenses (but we don't have a good way of excluding them). As those will run everyday (without success) that probably inflates the failure rates. But yeah would be nice to be able to schedule these better so we only have real failures.

And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.

This is because the Lighthouse report will now be JSON and so don't need to process the whole report to get a very small subset that data? That will indeed be a very nice improvement!

I would focus on rewriting queries to new all schema so that we can deprecate legacy. And if you think it will break the current process and can help rewriting bash into js templates - we can go full migration.

Yeah I started on that. Will try to finish it out.

BTW if I understood correctly the current solution is constantly checking and trying to process missing reports without much manual interference today - it's not the case with Dataform.

Yes. But that also has downsides as per above! It was very low tech. We can do better here so we don't need to do that.

When jobs fail in Dataform we'll need to fix and re-run them manually (the correct DAG being the solution). Queuing reports and retrying failed later doesn't work with this tool.

That shouldn't be an issue if we make this more robust.

max-ostapenko commented 1 month ago

And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.

This is because the Lighthouse report will now be JSON and so don't need to process the whole report to get a very small subset that data? That will indeed be a very nice improvement!

No, I meant we don't need to process all historical data every month, only last crawl. It's 2 rows per lens monthly.

We may even deduplicate some processing and atomically create a wide metrics table:

- date
- timestamp
- client
- lens
- a11yButtonName
...
- bootupJs
...
- webSocketStream

tunetheweb commented 1 month ago

The reports in the /histograms/ folder only run for the month needed: ${YYYY_MM_DD} is replaced by the actual date in the bash script.

max-ostapenko commented 1 month ago

Oh, true. I was talking about time series: https://github.com/HTTPArchive/bigquery/blob/master/sql/timeseries/a11yButtonName.sql

tunetheweb commented 1 month ago

Yeah but then I add a AND SUBSTR(_TABLE_SUFFIX, 0, 10) > \"$max_date\"" type clause - that's the incremental mode I talked about. So it won't reprocess all the historical months (unless we use the force option).

max-ostapenko commented 1 month ago

Oh, I expected something like this. Looks like magic unless you understand the script...

Ok, let me create a branch with some drafts, so that you can get a feeling about the workflow and templating.

tunetheweb commented 1 month ago

Yeah the script started simple, and then grew and grew and grew - well beyond what we should use a bash script for!

But hey, it works!

HTTPArchive / httparchive.org

Update report JSON exports to `crawl` dataset #938