Detect publishing platforms

rviscomi commented 7 years ago

Similar to https://github.com/HTTPArchive/httparchive/issues/77, detect the presence of publishing platforms like Wordpress and Drupal. A secondary goal would be to detect themes and plugins.

Unlike #77, the key metric here would just be a single string/enum value representing the detected platform - as opposed to a list.

For accurate detection, need to come up with a list of signals for each platform. This may be hard to achieve through custom metrics alone. For example, if it requires introspection of a script file's comments, that wouldn't be possible with client-side JS alone. We may need to do post-processing on response bodies.

igrigorik commented 7 years ago

Ideally, we want a breakdown similar to https://trends.builtwith.com/cms/. However, as a starting point, I think we can focus our conversation on WordPress and figure out the requirements and pipeline for that. With that in mind, a few thoughts...

There are two (complementary) ways we can attempt to detect these platforms:

At crawl runtime by looking for some platform-specific JS objects or signatures
Post-crawl by analyzing response headers and bodies

My hunch is that we'll get the most mileage from focusing on (2). In the context of WordPress:

Extract value of <meta generator...> in HTML
Look for response headers like Link rel=shortlink
- e.g. WordPress VIP appends X-Hacker and Link: wp.me shortlink in response
Look for wp-... references in the HTML and resource names
- This is how https://allthingsblogging.com/wordpress-theme-plugin-statistics/ is detecting themes and plugins
... and other heuristics

There may also be runtime specific signals we can extract, but I propose we focus on (2) as a starting point and see how far that gets us. Also, the other benefit of (2) is that we can update the logic and rerun the analysis on past crawls.. giving us access to trending data, and all the rest. Last but not least, we shouldn't restrict ourselves to a single label. In some cases we may be able to extract version number and other meta-data, so I think we should think of the output as another bag of values: {platform: x, version: y, theme: z, plugins: [...]}.

Concretely, we can extend the current DataFlow pipeline with an extra step and start encoding these rules there. For prototyping we can also run queries directly in BigQuery..

rviscomi commented 7 years ago

Working on a proof of concept: https://github.com/rviscomi/httparchive/blob/pub-cm/custom_metrics/publishing-platform.js Ex: https://www.webpagetest.org/custom_metrics.php?test=170412_6R_12GY&run=1&cached=0

There are two (complementary) ways we can attempt to detect these platforms

@igrigorik do you see (1) as the low-hanging first pass for well-formed pages, with (2) taking a closer look at everything else not already detected?

I think we can get real data more quickly with (1), albeit with more false negatives. In any case, I'll also look into extending the DataFlow pipeline as you mentioned.

rviscomi commented 7 years ago

Generated a httparchive:scratchspace.response_headers table with the response headers of 100k pages:

SELECT
  page,
  JSON_EXTRACT(payload, '$.response.headers') AS response_headers
FROM
  [httparchive:har.2017_03_15_chrome_requests]
LIMIT
  100000

Then ran this query on it:

SELECT
  page,
  response_headers
FROM (
  SELECT
    page,
    response_headers,
    REGEXP_MATCH(response_headers, 'X-Hacker') AS wordpress
  FROM
    [httparchive:scratchspace.response_headers]
)
WHERE
  wordpress = true

No results.

Changed the regexp pattern to 'wp\.me[^}]+rel=shortlink' and got 20 results: https://bigquery.cloud.google.com/savedquery/226352634162:91fdb59df0cd4af2ade704c45cf70d66

Seems like not a strong signal. WDYT?

igrigorik commented 7 years ago

Hmm. We could sanity check against: https://vip.wordpress.com/clients/

curl -L -vv http://motori.virgilio.it/
< X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
< Link: <http://wp.me/6uzoc>; rel=shortlink

On the other hand, lots of sites on that client list don't deliver above header either..

curl -L -vv http://www.nationalpost.com/
li data-src-fullsize="http://nationalpostcom.files.wordpress.com/2013/01/

^ perhaps we should also look for files.wordpress.com, although I'm not sure if that's VIP only or true for any wordpress.com hosted site.

rviscomi commented 7 years ago

Oh I didn't do a case insensitive search. I'll try that to include X-hacker.

rviscomi commented 7 years ago

Ok I updated the query to be case insensitive and match both X-Hacker and rel=shortlink patterns.

I stuffed those 27 results into httparchive:scratchspace.wordpress_headers and used that to generate another table, httparchive:scratchspace.wordpress_response_bodies:

SELECT page, url, body FROM [httparchive:har.2017_03_15_chrome_requests_bodies] WHERE
page IN (SELECT page FROM [httparchive:scratchspace.wordpress_headers])

It joins the pages with WP headers with corresponding response bodies. Finally, I queried this table with the same signals in the custom metric POC:

SELECT
  COUNT(0),
  page
FROM
  [httparchive:scratchspace.wordpress_response_bodies]
WHERE
  REGEXP_MATCH(body, r'(?i)(<meta[^>]*WordPress|<link[^>]*wlwmanifest|src=[\'"]?[^\'"]*wp-includes)')
GROUP BY
  page

See https://bigquery.cloud.google.com/savedquery/226352634162:6f88d370ccec4fe59d45dc28040f9982

Of the 100,000 pages sampled, 27 pages were detected with WP headers. 25 of those also had corresponding WP signals in the response body. The discrepancy seems to be due to a conflicting use of the X-hacker header for non-WordPress use.

That said, it seems like markup analysis is no worse of a signal than header analysis. So I ran a related query to see how much better markup analysis is:

SELECT
  COUNT(0),
  page
FROM
  [httparchive:har.2017_03_15_chrome_requests_bodies]
WHERE
  REGEXP_MATCH(body, r'(?i)(<meta[^>]*WordPress|<link[^>]*wlwmanifest|src=[\'"]?[^\'"]*wp-includes)') AND
  page IN (SELECT page FROM [httparchive:scratchspace.response_headers])
GROUP BY
  page

See https://bigquery.cloud.google.com/savedquery/226352634162:e34f729043dc4ca18719ca716d3a4642

This looks only at the 100,000 pages sampled by the header analysis and runs the body analysis. There are 9,677 results, or about 10%. It's still half as much as reported, so it seems like there are other strong signals we're missing.

To recap:

header analysis is a very weak signal compared to body analysis
body analysis is something we can do easily in a custom metric without altering the pipeline
need to shore up the signals in the POC for parity with prior research

igrigorik commented 7 years ago

As a meta thing, It'd be nice to start building a list of test cases and explanations for each pattern:

<meta generator> --> list of URLs
wlmanifest --> list of URLs
...

Otherwise, based on past experience, you quickly end up with unwieldy regexes that break easily and are impossible to maintain long-term.

rviscomi commented 7 years ago

It'd be nice to start building a list of test cases and explanations for each pattern

Definitely. I'd first like to figure out which signals are weak/strong/redundant and narrow it down to a minimal list of strong signals. The /docs would be a good place to explain what each signal in that list is measuring and its efficacy.

"Weak" can mean that the signal has a high number of false positives or a low number of true positives. Eg X-header seems to be a weak signal for the latter reason. There may still be some value in these types of weak signals, for example if many of them combined produce a significant number of detections.

rviscomi commented 7 years ago

Good news! Someone has already thought about this 😄

See AliasIO/Wappalyzer

We could do something similar to the library detector and generate a custom metric script based on the Wappalyzer apps.json file. Being JSON, it'd be easier to filter it out to only the platforms and props we're interested in.

HTTPArchive / legacy.httparchive.org

Detect publishing platforms #90