Parsed CSS custom metric

HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents

Apache License 2.0

19 stars 22 forks source link

Parsed CSS custom metric #44

Closed rviscomi closed 2 years ago

rviscomi commented 2 years ago

This script tokenizes all CSS on a page for later analysis, including: external stylesheets, inline <style> blocks, and inline [style] attribute values.

Everything after line 11 is taken from the Rework CSS script that we run in BigQuery for the Web Almanac.

rviscomi commented 2 years ago

I'm not totally sure we want to do this in a custom metric as opposed to post-processing in Dataflow, or continuing to process in BigQuery. To summarize my thoughts:

Pros:

Avoids UDF memory limitations in BigQuery
Rework CSS is written in JS so it's compatible with custom metrics out out of the box
More accurate inline/block styles using DOM APIs

Cons:

Adds to HAR size (can be extracted in Dataflow)
Adds to testing time
custom-metrics repo is not set up for JS modules
Inability to reprocess historical results

pmeenan commented 2 years ago

This feels a bit weird. If there are concrete things that we use the tokenized css to calculate, then those should probably be boiled down into custom metrics but storing the tokenized version adds a LOT of data and likely needs to be treated like bodies and split out in the processing (and should JS and HTML also be tokenized?)

rviscomi commented 2 years ago

Yeah it's weird. The concrete things we want to do are similar to these queries but my concern is that there are ~130 of them which adds a lot of complexity to the custom metric.

We have two decisions: where to do the work of parsing the CSS, and where to do the analysis. In both cases WPT, Dataflow, or BigQuery are viable options. Having the parser written in JS (~600 LOC) makes it a lot easier to do in either WPT or BigQuery (with a UDF). Having the queries written in SQL (~15,000 LOC) makes it a lot easier to continue doing in BigQuery. Any other combination will require porting the logic to JS or Python.

rviscomi commented 2 years ago

storing the tokenized version adds a LOT of data and likely needs to be treated like bodies and split out in the processing (and should JS and HTML also be tokenized?)

Forgot to address this question. I do think we should omit the tokenized CSS from the HAR payloads in the pages dataset. We can do this as part of the Dataflow pipeline, similar to the way we drop large Lighthouse fields from the LHR. Rather than dropping the tokenized CSS, we'd be writing it to a separate BQ table for future analysis or doing more processing in the Dataflow step (more complex, but ideal).

rviscomi commented 2 years ago

I've added the following limits to guard against blowing up the HAR size / worker memory utilization:

Don't parse any stylesheet 500KB+
Don't output any AST 4MB+
Don't output any custom metric value 50MB+

The last one was kind of arbitrary, open to better suggestions.

Tests:

tunetheweb commented 2 years ago

Does this mean you're no longer going with the Python data-pipeline route?

rviscomi commented 2 years ago

That's right. See the thread in the 10x infra channel starting with "Because I love pain..."

pmeenan commented 2 years ago

Results and protections SGTM. Are we doing anything in the processing so they aren't included in the HAR tables?

rviscomi commented 2 years ago

Not yet. We'll need to also update that pipeline code in a new PR, which I'll start drafting.