Open max-ostapenko opened 1 month ago
They've been hacked and hacked and hacked and are now a mess. Mostly my hacking. Open to better ideas here (dataform?).
Basically they need to be able to do this:
all
, or CrUX.*crux*
ones, or just the numURLs
one casue it was wrong and needs force rerun).Thoughts on how best to tackle this WITHOUT hacky bash scripts?
Considering the requirement list no surprise it's a hacky one :)
We could move it dataform, but I'm not sure it will give any advantages.
How it would look in dataform:
Basically they need to be able to do this:
- Run monthly histograms SQLs when crawl is finished
Assign a tag to histogram report table jobs, and include it to run on crawl_complete
event
- Run longer term time series SQLs when crawl is finished
Assign a tag to time series report table jobs, and include it to run on crawl_complete
event
- Be able to run the time series in an incremental fashion (download existing JSON, run for any months since then, and merge them together—this lead to HUGE timing savings) which currently requires injecting additional SQL clauses (using different data formats if it's HTTP Archive legacy,
all
, or CrUX.
I think we could be better off moving this logic to SQL completely. Having a table per each report and exporting whole table into GS after each table change. This way all the logic will stay in BQ/dataform/SQL, and downstream can be operated independently based on BQ native log events as triggers (exports can't be implemented using SQL anyway).
- Handle different lenses (Top X, WordPress, Drupal, Magento) which currently requires injecting additional SQL clauses into the current queries. We should make this natively supported in the current SQL with an optional param.
Seems like a templating scenario, can be implemented with nodeJS in Dataform.
- Handle CrUX reports (monthly histograms and time series) having to run later.
Different tag, different trigger
- Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN (example json) and be served to our site (example report) or others.
Related to incrementality implementation note above, this needs to be separated (Cloud Function?). Bonus point if we could have event driven integration (subscribe to BQ logs, export JSON whenever there is a successful completion of a job writing to report table)
- Be able to run and only run reports missing (histograms) or missing dates (time series)
Missing why? If any jobs fail in the Dataform workflow we'll get an alert, check the execution logs, fix issues and rerun selected items manually in dev workspace. The date range and other parameter adjustments need to be done in the templates.
- Be able to force rerun (to override any existing reports). Note that for time series this means running from beginning of time so can take a long time (especially for some of the expensive ones!), but we don't do that often.
Select a report in dev workspace, adjust template parameters and hit run
.
- Be able to run a subset of reports (e.g. all
*crux*
ones, or just thenumURLs
one casue it was wrong and needs force rerun).
Can be done using additional tags. But I hope we can get to stable runs and then UI is very helpful for manual scenarious.
Yeah, seems Dataform matches requirements well (simple and native)
So I suppose it's a manual step today? Deployed somewhere or running locally? How much time does it take?
Deployed on a VM in the cloud. We do a "git pull" to release a new report and it's kicked off from cron every morning and bails early if tables not ready, and then runs each report when data is available, and then checks (but does not run) each report every other day of that month. A bit wasteful to be honest!
It takes a full day to run all reports for all lenses.
Missing why?
Ideally no missing data. But sometimes it fails and I don't get run to fixing it that month. The way it works currently is it checks the last date and then runs the report with AND date > LAST_GOOD_DATE
which means it automatically fills in the missing data for the time series without me having to run it twice for each month. Not essential but kinda nice feature of the current script.
Oh, so it's the queries that I see running today (with VM service account) Based on logs it takes more than 2 days (PWA scores not available while running) and 17% of jobs fail.
And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.
I would focus on rewriting queries to new all
schema so that we can deprecate legacy.
And if you think it will break the current process and can help rewriting bash into js templates - we can go full migration.
BTW if I understood correctly the current solution is constantly checking and trying to process missing reports without much manual interference today - it's not the case with Dataform.
When jobs fail in Dataform we'll need to fix and re-run them manually (the correct DAG being the solution). Queuing reports and retrying failed later doesn't work with this tool.
Oh, so it's the queries that I see running today (with VM service account)
Yup.
Based on logs it takes more than 2 days (PWA scores not available while running)
Oh that's a good point! That no longer exists so will never complete. But it's running everyday trying. Have deleted that file so it won't be run anymore. Will delete from GitHub too.
and 17% of jobs fail.
As well as odd examples like above, CrUX ones will fail until that data is ready. And some don't work with lenses (but we don't have a good way of excluding them). As those will run everyday (without success) that probably inflates the failure rates. But yeah would be nice to be able to schedule these better so we only have real failures.
And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.
This is because the Lighthouse report will now be JSON and so don't need to process the whole report to get a very small subset that data? That will indeed be a very nice improvement!
I would focus on rewriting queries to new
all
schema so that we can deprecate legacy. And if you think it will break the current process and can help rewriting bash into js templates - we can go full migration.
Yeah I started on that. Will try to finish it out.
BTW if I understood correctly the current solution is constantly checking and trying to process missing reports without much manual interference today - it's not the case with Dataform.
Yes. But that also has downsides as per above! It was very low tech. We can do better here so we don't need to do that.
When jobs fail in Dataform we'll need to fix and re-run them manually (the correct DAG being the solution). Queuing reports and retrying failed later doesn't work with this tool.
That shouldn't be an issue if we make this more robust.
And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table.
This is because the Lighthouse report will now be JSON and so don't need to process the whole report to get a very small subset that data? That will indeed be a very nice improvement!
No, I meant we don't need to process all historical data every month, only last crawl. It's 2 rows per lens monthly.
We may even deduplicate some processing and atomically create a wide metrics table:
- date
- timestamp
- client
- lens
- a11yButtonName
...
- bootupJs
...
- webSocketStream
The reports in the /histograms/
folder only run for the month needed: ${YYYY_MM_DD}
is replaced by the actual date in the bash script.
Oh, true. I was talking about time series: https://github.com/HTTPArchive/bigquery/blob/master/sql/timeseries/a11yButtonName.sql
Yeah but then I add a AND SUBSTR(_TABLE_SUFFIX, 0, 10) > \"$max_date\""
type clause - that's the incremental mode I talked about. So it won't reprocess all the historical months (unless we use the force
option).
Oh, I expected something like this. Looks like magic unless you understand the script...
Ok, let me create a branch with some drafts, so that you can get a feeling about the workflow and templating.
Yeah the script started simple, and then grew and grew and grew - well beyond what we should use a bash script for!
But hey, it works!
For the script located in https://github.com/HTTPArchive/bigquery/tree/master/sql:
I can help with SQL rewrite, but not familiar with nuances of these bash scripts
Related issues: