GoogleChrome / lighthouse-stack-packs

Lighthouse Stack Packs
Apache License 2.0
205 stars 75 forks source link

Reliable WordPress detection #8

Open paulirish opened 5 years ago

paulirish commented 5 years ago

One requirement for adding a stack pack is reliably detecting that the stack/library/platform is being used by the page. We want this detection to be as reliable and bulletproof as possible.

Wappalyzer uses a few approaches which seem overkill and not something we can reuse. We'd like something much more lightweight.

Primary question: Can we detect wordpress via via clientside JS running in the page? (Naturally, it has full access to window and the DOM.)

Secondary question: Is there another reliable detect based on the network request metadata? We'd like to avoid parsing the response of any network resources (so no looking for patterns in HTML, JS or CSS files). But considering response headers or paths in urls (like wp-content, etc) is fine.

Could some WordPress experts chime in?

Shelob9 commented 5 years ago

I saw this tweet - https://twitter.com/hdjirdeh/status/1092875246309265408 happy to help.

Since WordPress 4.4, there should be a link element in the header with a rel attribute equal to https://api.w.org/ and a href attribute equal to the site's URL. That's pretty good for recent versions.

https://github.com/WordPress/WordPress/blob/fe73f310d4502c978650da998fe985c9c6f9dba0/wp-includes/rest-api.php#L747

There are some endpoints that have to exist. wp-admin.php, wp-cron.php, etc. I wonder if checking for those is enough? If site.com/wp-admin.php returns a status code for unauthroized access, probably a WordPress site -- or a security feature blocked it.

machour commented 5 years ago

Not a WordPress expert by any means, but here's my two cents from experience:

This two points are what Wappalyzer does, not sure why this is considered as an overkill.

But please, pretty please, don't poke around URLs like /wp-admin.php, /wp-cron.php, .. this would be really naughty.

machour commented 5 years ago

PS: the Link header may not be available if the REST api is not enabled.

Definitely go with wp-content/wp-include. Haven't come across any wordpress site which didn't have at least one occurence of those in more than a decade of poking around.

paulirish commented 5 years ago

there should be a link element in the header with a rel attribute equal to api.w.org and a href attribute equal to the site's URL.

That's a great option. Thanks.

There are some endpoints that have to exist

Making new network requests is out of scope for us, but good thinkin'.

You will need to check the HTML code for wp-(content|include) occurrences

We are considering all network requests, so if the page makes requests to wp-content|include URLs, we could use that. Less attractive than via JS, though. :)

or test for the Link header.

image

On my WP sites, in addition to the <link> tag in the HTML, I also see a Link response header with similar values. Does anyone know if that Link response header is reliable or sometimes stripped out? (Cloudflare keeps it, at least)

paulirish commented 5 years ago

PS: the Link header may not be available if the REST api is not enabled.

@machour https://developer.wordpress.org/rest-api/using-the-rest-api/frequently-asked-questions/#can-i-disable-the-rest-api recommends against disabling and doesn't even document how to anyway.. But have you seen sites that do disable it for security reasons or something?

machour commented 5 years ago

@paulirish here are some insights

Link is generated by this function

function rest_output_link_header() {
    if ( headers_sent() ) {
        return;
    }
    $api_root = get_rest_url(); // <--
    if ( empty( $api_root ) ) {
        return;
    }
    header( 'Link: <' . esc_url_raw( $api_root ) . '>; rel="https://api.w.org/"', false );
}

get_rest_url() allow developper to apply their own filters on its return value: https://github.com/WordPress/WordPress/blob/fe73f310d4502c978650da998fe985c9c6f9dba0/wp-includes/rest-api.php#L403

So any developer could write a filter that simply returns false and the header won't get emitted. There are also a few extensions out there for that: https://wordpress.org/plugins/disable-json-api/#description

Shelob9 commented 5 years ago

@paulirish ->

But have you seen sites that do disable it for security reasons or something?

Yes, some plugins do this for "security" reasons. That's a) a bad idea with little upside b) probably going to be combined with security through slight obscurity measures like changing the location of wp-login.php or wp-content dir. c) Not super common.

paulirish commented 5 years ago

@machour and @Shelob9 very useful, thank you.

For anyone else visiting this thread.. I'm still interested in ideas regarding detection with JS.

For example:

// passes if any favicons, stylesheets are provided by the theme
!!document.querySelector('link[href*="wp-content"]')

// same but including scripts, too..
!!document.querySelectorAll('link[href*="wp-content"], script[src*="wp-content"]').length

It's probably possible for a WP site to not trigger the second detect, though IMO it'd be very uncommon.

westonruter commented 5 years ago

Note that may both fail for WordPress sites that are using the AMP plugin, since AMP disallows external stylesheets and custom scripts. If they haven't set a favicon, then there won't be any such links for icons.

On the other hand, the majority of WordPress sites keep the generator meta tag intact:

<meta name="generator" content="WordPress 5.0.3">

So the very first thing to check for is whether it exists in the page:

!!document.querySelector('meta[name=generator][content^="WordPress"]')
machour commented 5 years ago

@paulirish images uploaded through WordPress are usually available under "/wp-content/uploads/", so you may want to extend the selector to check img[src*="wp-content"] as well.

Shelob9 commented 5 years ago

wp-content is the default it can be changed by setting a constant in wp-config. wp-includes (effectively) can't be changed.

housseindjirdeh commented 5 years ago

So it looks like we can do something like this:

if (`<meta name="generator" content="WordPress 5.0.3">`) {
  // get WordPress version
} else if (`wp-includes` tags are present) {
  // unknown version, but WordPress is being used
}

I think that should work, especially since wp-includes can't be changed according to @Shelob9. Let me know what you all think.

housseindjirdeh commented 5 years ago

A PR with this in place: https://github.com/johnmichel/Library-Detector-for-Chrome/pull/131/files

Lefaux commented 5 years ago

I think going for file requests is the only (halfway) reliable way here. Not showing the meta generator in the source is a pretty good idea from a security POV, because you don't want to hand the "bad guys" scraping info on a silver platter. I also wonder how to handle CMS that give the user 100% control over their sourcecode.

Maybe the way web-policies work could be an option here. So lighthouse provides a list of whitelisted hosts that request the site and the CMS returns a header to identify itself. CMS vendors could then update the list of trusted hosts on their end to make sure the header is only sent to trusted sources.

igrigorik commented 5 years ago

@Lefaux I agree that current logic might miss some sites, but we're to shooting for 100%.

If you have a prod policy that strips any and all identifiers, a tool auditing the public site may not detect it as such — WAI. However, LH can also be run locally and against development environments; the prod policy can be configured to provide the identifier to your IP or based on logged in state, etc. Which is to say, it's possible to configure the environment to emit the necessary signals, if you're motivated to do so..

The whitelist solution won't scale to a distibuted & self-hosted ecosystem, and it opens a yet different can of worms about out of band requests to a 3P.

Lefaux commented 5 years ago

Yeah, I see the problem with scaling, too. It's not about having a strict policy on our end, it's just that quite a few users of TYPO3 simply disable the generator tag because of their policies :) Another concern I have is about bloating the source for every request, even though only a fraction of requests actually come from Lighthouse. I mean... Google asks us to fight for every byte (and I completely agree here). If you think I'm taking this request too serious, let me know (fighting for every byte is hard, and naturally I'm lazy :).

Since the whitelisting stuff is not really an option what's your thought on having a XMLHttpRequest against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.

midzer commented 5 years ago

Since the whitelisting stuff is not really an option what's your thought on having a XMLHttpRequest against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.

Yeah, why not check whether /wp-admin/ does not deliver a 404.

igrigorik commented 5 years ago

Since the whitelisting stuff is not really an option what's your thought on having a XMLHttpRequest against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.

How would one protect such an endpoint? As in, exposing such a mechanism would work against the very reason why you were stripping the platform+version information. Further, I don't think we can or should rely on new endpoints..

Yeah, why not check whether /wp-admin/ does not deliver a 404.

That requires an additional out of band request which, while not impossible, is something we'd like to avoid.

Stepping back, I'll come back to what I said earlier: if your site is designed to hide all platform information, then the fact that LH is not able to detect it is not a bug, it's WAI. Despite that, developers that want to see stack specific advice can still get access to it.. by, for example, configuring the environment to expose those signals under certain conditions. Alternatively, one could also imagine a UI where you can manually pick which stack pack strings LH shows, even if LH is not able to detect that platform itself.

Does that seem reasonable? :)