LeaVerou / css-almanac

Repo for planning & voting on which stats to study
https://projects.verou.me/mavoice/?repo=leaverou/css-almanac&labels=proposed%20stat
33 stars 3 forks source link

Which properties are most commonly used together? #26

Open LeaVerou opened 4 years ago

LeaVerou commented 4 years ago

It would be especially cool to see which non-related properties are often used together.

(idea by @karger)

LeaVerou commented 3 years ago

How do we calculate this metric? I suppose the first step is to calculate which properties are used together, regardless of whether they are related or not. I wrote some initial code for that that operates on the AST of a single stylesheet:

Pen: https://codepen.io/leaverou/pen/dyGrLvy Page in this repo that accepts URL parameters for the CSS URL and the threshold for usage (i.e. discard pairs used together less than N times): https://leaverou.github.io/css-almanac/rework-code/frequently-used-together.html?url=https://lea.verou.me/wp-content/themes/lv2020/style.css&limit=4

LeaVerou commented 3 years ago

Dropping this here. @karger pointed out that there are far more sophisticated algorithms in data science to draw meaningful conclusions from the data than simply counting rules in which the properties appear together.

Links:

LeaVerou commented 3 years ago

So what should we do here? @rviscomi @dooman87 thoughts?

rviscomi commented 3 years ago

I love the meatiness of this problem statement and it has a lot of insightful potential. I'll need more time to review the specific algorithms, but one bit of guidance that may help narrow the options would be to select a JS library that we can run as UDFs in BigQuery.

LeaVerou commented 3 years ago

I'm going to mark this as "has analysis plan" even though we haven't 100% decided on the algorithm, so I don't keep coming back to it.

LeaVerou commented 3 years ago

I finally pushed improved code for this. I'd say this was by far the hardest (non-custom) metric to compute. First, I tried the Apriori algorithm. I tried two different libraries but did not get great (or even meaningful) results. Once the property sets in the training data were more than a dozen or so, it would produce no association rules, even with very low min_support and confidence.

So I ad hoc implemented a custom algorithm somewhat inspired from Apriori. I will try to document it here before I forget how it works myself 😛

After this process, we have an object literal like this (from Smashing's CSS).

We then traverse this object and store the combinations in an object literal, with values being the % of total rules the combination appears in. Currently this structure is not pruned since I figured aggregation will do so anyway, but it can easily be pruned for efficiency. The end result is a literal like this.

rviscomi commented 3 years ago

@LeaVerou I'm unable to get the query for this working, the UDF appears to be running out of memory. I've tried limiting the size of the parsed CSS to 100 KB which usually works, but this algorithm seems to use a lot of memory.

It looks like there are a few constants to help tune the memory consumption. Could you recommend a config that might help?

Also, while writing this query I changed the values from percents to frequency counts so the absolute counts of pair combos can be aggregated across the dataset: usage[property] = u.total;. Does that sound ok?

LeaVerou commented 3 years ago

Yeah, this is probably the most computationally intensive of all of them, by far. You could save on some memory consumption by decreasing the FOR_PRUNE constants (so the tree is pruned more) and increasing the FOR_DESCENT constants (so fewer branches are descended). Try tweaking the percents first instead of the absolute numbers. Avoid tweaking MIN_TOTAL_FOR_DESCENT unless tweaking the other three constants doesn't work. Does this help?

Does that sound ok?

It's this line that you need to change to have counts instead:

frequent[path.join(", ")] = 100 * u.total / totalRules;
rviscomi commented 3 years ago

Still no luck :(

Here's the most extreme configuration I tried:

    // Total number of rules to descend and try to extend the subset by one
    const MIN_TOTAL_FOR_DESCENT = 8;

    // Min % of rules child property appears in compared to parent combination to
    // try to extend the subset by one
    const MIN_PERCENT_FOR_DESCENT = .8;

    // Properties or combinations used this many times will be pruned
    const MAX_TOTAL_FOR_PRUNE = 2;

    // Child properties used in that many instances of the parent combination will be pruned
    const MAX_PERCENT_FOR_PRUNE = .01;

I've also restricted the input AST to 10 KB.

Do you think it's feasible to optimize this algorithm for better memory usage? Alternatively, would the results be too far skewed if we further constrain the input?

It's this line that you need to change to have counts instead:

Whoops, that's the line I changed but I copied the wrong one.

LeaVerou commented 3 years ago

I just pushed 26-property-pairs.js which is based on the earlier, naïve algorithm. That only produces pairs, but I think that should be much lighter in memory consumption. Can you try it?

rviscomi commented 3 years ago

Yes, this version works! Results here

karger commented 3 years ago

Now that you've identified the pairs, can you manually prune to the top/most interesting pairs and use that to seek noticeable triples?

rviscomi commented 3 years ago

IIUC I don't think we can draw conclusions about triples transitively. For example, property pairs A:B and B:C may be popular, but we don't know that A:C or A:B:C are popular.