Open LeaVerou opened 4 years ago
How do we calculate this metric? I suppose the first step is to calculate which properties are used together, regardless of whether they are related or not. I wrote some initial code for that that operates on the AST of a single stylesheet:
Pen: https://codepen.io/leaverou/pen/dyGrLvy Page in this repo that accepts URL parameters for the CSS URL and the threshold for usage (i.e. discard pairs used together less than N times): https://leaverou.github.io/css-almanac/rework-code/frequently-used-together.html?url=https://lea.verou.me/wp-content/themes/lv2020/style.css&limit=4
Dropping this here. @karger pointed out that there are far more sophisticated algorithms in data science to draw meaningful conclusions from the data than simply counting rules in which the properties appear together.
Links:
So what should we do here? @rviscomi @dooman87 thoughts?
I love the meatiness of this problem statement and it has a lot of insightful potential. I'll need more time to review the specific algorithms, but one bit of guidance that may help narrow the options would be to select a JS library that we can run as UDFs in BigQuery.
I'm going to mark this as "has analysis plan" even though we haven't 100% decided on the algorithm, so I don't keep coming back to it.
I finally pushed improved code for this. I'd say this was by far the hardest (non-custom) metric to compute. First, I tried the Apriori algorithm. I tried two different libraries but did not get great (or even meaningful) results. Once the property sets in the training data were more than a dozen or so, it would produce no association rules, even with very low min_support and confidence.
So I ad hoc implemented a custom algorithm somewhat inspired from Apriori. I will try to document it here before I forget how it works myself 😛
MIN_TOTAL_FOR_DESCENT
rules, we find the properties they appear together with, and store the rules that contain bothMAX_TOTAL_FOR_PRUNE
rules or <= MAX_PERCENT_FOR_PRUNE
* 100% of the rules the individual properties appear in are prunedMIN_TOTAL_FOR_DESCENT
rules, when the new combination (with the addition of one property) appears in MIN_PERCENT_FOR_DESCENT
* 100% of the rules the subset appears in.After this process, we have an object literal like this (from Smashing's CSS).
We then traverse this object and store the combinations in an object literal, with values being the % of total rules the combination appears in. Currently this structure is not pruned since I figured aggregation will do so anyway, but it can easily be pruned for efficiency. The end result is a literal like this.
@LeaVerou I'm unable to get the query for this working, the UDF appears to be running out of memory. I've tried limiting the size of the parsed CSS to 100 KB which usually works, but this algorithm seems to use a lot of memory.
It looks like there are a few constants to help tune the memory consumption. Could you recommend a config that might help?
Also, while writing this query I changed the values from percents to frequency counts so the absolute counts of pair combos can be aggregated across the dataset: usage[property] = u.total;
. Does that sound ok?
Yeah, this is probably the most computationally intensive of all of them, by far.
You could save on some memory consumption by decreasing the FOR_PRUNE
constants (so the tree is pruned more) and increasing the FOR_DESCENT
constants (so fewer branches are descended).
Try tweaking the percents first instead of the absolute numbers.
Avoid tweaking MIN_TOTAL_FOR_DESCENT
unless tweaking the other three constants doesn't work.
Does this help?
Does that sound ok?
It's this line that you need to change to have counts instead:
frequent[path.join(", ")] = 100 * u.total / totalRules;
Still no luck :(
Here's the most extreme configuration I tried:
// Total number of rules to descend and try to extend the subset by one
const MIN_TOTAL_FOR_DESCENT = 8;
// Min % of rules child property appears in compared to parent combination to
// try to extend the subset by one
const MIN_PERCENT_FOR_DESCENT = .8;
// Properties or combinations used this many times will be pruned
const MAX_TOTAL_FOR_PRUNE = 2;
// Child properties used in that many instances of the parent combination will be pruned
const MAX_PERCENT_FOR_PRUNE = .01;
I've also restricted the input AST to 10 KB.
Do you think it's feasible to optimize this algorithm for better memory usage? Alternatively, would the results be too far skewed if we further constrain the input?
It's this line that you need to change to have counts instead:
Whoops, that's the line I changed but I copied the wrong one.
I just pushed 26-property-pairs.js
which is based on the earlier, naïve algorithm. That only produces pairs, but I think that should be much lighter in memory consumption. Can you try it?
Yes, this version works! Results here
Now that you've identified the pairs, can you manually prune to the top/most interesting pairs and use that to seek noticeable triples?
IIUC I don't think we can draw conclusions about triples transitively. For example, property pairs A:B and B:C may be popular, but we don't know that A:C or A:B:C are popular.
It would be especially cool to see which non-related properties are often used together.
(idea by @karger)