Investigate results of 02_45

rviscomi commented 5 years ago

From metric 02_45 in the CSS analysis results sheet:

Read as: "On desktop pages, when the class attribute is defined the median number of classes is 1"

client	p10	p25	p50	p75	p90
desktop	1	1	1	2	3
mobile	1	1	1	2	3

@argyleink left this comment:

this data doesnt look right. the p90 values should be much higher. i expected a range of 0-12

look at tailwind or any OOCSS lib, and you can get anything done without at least 5 classes on an element. something smelly about these results imo.

The query is at https://github.com/HTTPArchive/almanac.httparchive.org/blob/master/sql/2019/02_CSS/02_45.sql

Need to investigate whether the results are accurate.

rviscomi commented 5 years ago

I made a histogram of the frequencies of each classList length:

#standardSQL
SELECT
  client,
  classes,
  COUNT(0) AS freq,
  SUM(COUNT(0)) OVER (PARTITION BY client) AS total,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY client), 2) AS pct
FROM (
  SELECT
    client,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(value, '([^\\s]+)(?:\\s+|$)')) AS classes
  FROM
    `httparchive.almanac.summary_response_bodies`,
    UNNEST(REGEXP_EXTRACT_ALL(body, '(?i)class=[\'"]([^\'"]+)')) AS value
  WHERE
    firstHtml)
GROUP BY
  client,
  classes
ORDER BY
  freq / total DESC

classes	desktop	mobile
0	0.11%	0.11%
1	64.27%	63.39%
2	20.18%	20.52%
3	7.44%	7.71%
4	4.31%	4.49%
5	1.71%	1.75%
6	0.74%	0.74%
7	0.38%	0.39%
8	0.22%	0.22%
9	0.19%	0.19%
10	0.10%	0.10%
11	0.06%	0.07%
12	0.05%	0.05%
13	0.04%	0.04%
14	0.03%	0.03%
15	0.02%	0.03%

Methodology note: This is an analysis of the static home page markup, not accounting for classes added dynamically in JS.

According to the results, 1 or 2 class names make up 80+% of all attribute values, so a p90 of 3 makes sense.

@argyleink do you think OOCSS libraries are prolific enough to skew the distribution? In 02_10 we see very few Tailwind pages, which is probably full of false negatives from Wappalyzer, but other than Bootstrap and animate.css websites don't seem to be using CSS libraries that much.

Also, I think this theme of having our assumptions gut-checked by the data is a perfect thing to talk about in the Almanac chapters. @argyleink @una you can talk about how and why the results surprised you and what that says about the state of CSS. cc @HTTPArchive/authors

argyleink commented 5 years ago

interesting.. i brought up OOCSS libs because they're a primo example of AMPLE usage of classes, like enough that the high end could/should be quite a few classes per node. I could be convinced that OOCSS libs like tachyons, tailwind, etc arent popular enough to influence the data, but even other seamingly very very popular libraries like bootstrap or strategies like BEM have more than 2 classes on them almost always.

so sure, we could use this as a talking point, because it's counter to our assumptions. but something still feels off, like, # of classes on an element shouldnt be the most surprising result from scrubbing the entire web and comparing it to our assumptions. yet it is right now. 🤷‍♂

rviscomi commented 5 years ago

The total percent of all class name lengths greater than 10 is only 0.5%. We're talking about a small percent of a sample of ~1.6B class attributes though, so that's still ~8M instances of having 10+ classes. That perspective might make this more digestible.

Fun fact: the most class names is 21,504! It only occurs once and I assume that's a parsing bug 😁

Here's the sheet with the full results if you want to explore.

I'll continue looking into this. For example, maybe a different strategy of counting BEM-style classes would be helpful. Also let me know if you think there's another approach that might help. One other thing we could do for the upcoming October crawl is add a custom metric to count classList lengths so we're actually querying the DOM rather than parsing HTML with regexes, for better confidence.

argyleink commented 5 years ago

great ideas. that sheet is fascinating!

una commented 5 years ago

I think this data is super fascinating! I don't think oocss libraries are as widely used on the web as they seem from some circles. Also I wonder if this speaks to the majority of the web not using reusable/global styles as frequently. The data on 0 classes being used surprised me, as I assume many people are still styling based on base element. How would this account for nested elements like '.list li'?

Una

On Fri, Sep 6, 2019, 8:03 PM Adam Argyle notifications@github.com wrote:

great ideas. that sheet is fascinating!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/almanac.httparchive.org/issues/139?email_source=notifications&email_token=AAM5L3FVL4OJFVUUL3RZIRDQIKLNNA5CNFSM4ITZ4VIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DTQFQ#issuecomment-528955414, or mute the thread https://github.com/notifications/unsubscribe-auth/AAM5L3FQVWRPSFHUA7YZZGTQIKLNNANCNFSM4ITZ4VIA .

rviscomi commented 5 years ago

The data on 0 classes being used surprised me, as I assume many people are still styling based on base element. How would this account for nested elements like '.list li'?

Do you mean in the CSS? This query doesn't take the selectors into account, only the class attributes in the HTML. So the 0 values in this case are people with empty attributes, eg class="".

foxdavidj commented 5 years ago

This data looks pretty accurate to me.

At first I took a double take much like you all are describing. And it took me a bit to remember the majority of sites aren't "techy" like Medium, but instead use basic Wordpress themes or have a rudimentary static site like Hacker News.

After I started looking at sites like these instead, the data began making sense when I saw tons of classes like: email, button, container, four columns, footer, latest_post_image clearfix, woocommerce single-product

HTTPArchive / almanac.httparchive.org

Investigate results of 02_45 #139