Open paulirish opened 5 years ago
I saw this tweet - https://twitter.com/hdjirdeh/status/1092875246309265408 happy to help.
Since WordPress 4.4, there should be a link element in the header with a rel attribute equal to https://api.w.org/ and a href attribute equal to the site's URL. That's pretty good for recent versions.
There are some endpoints that have to exist. wp-admin.php, wp-cron.php, etc. I wonder if checking for those is enough? If site.com/wp-admin.php returns a status code for unauthroized access, probably a WordPress site -- or a security feature blocked it.
Not a WordPress expert by any means, but here's my two cents from experience:
wp-(content|include)
occurrences, or test for the Link header. as suggested by @Shelob9 This two points are what Wappalyzer does, not sure why this is considered as an overkill.
But please, pretty please, don't poke around URLs like /wp-admin.php
, /wp-cron.php
, .. this would be really naughty.
PS: the Link
header may not be available if the REST api is not enabled.
Definitely go with wp-content/wp-include
. Haven't come across any wordpress site which didn't have at least one occurence of those in more than a decade of poking around.
there should be a link element in the header with a rel attribute equal to api.w.org and a href attribute equal to the site's URL.
That's a great option. Thanks.
There are some endpoints that have to exist
Making new network requests is out of scope for us, but good thinkin'.
You will need to check the HTML code for
wp-(content|include)
occurrences
We are considering all network requests, so if the page makes requests to wp-content|include
URLs, we could use that. Less attractive than via JS, though. :)
or test for the Link header.
On my WP sites, in addition to the <link>
tag in the HTML, I also see a Link
response header with similar values. Does anyone know if that Link response header is reliable or sometimes stripped out? (Cloudflare keeps it, at least)
PS: the
Link
header may not be available if the REST api is not enabled.
@machour https://developer.wordpress.org/rest-api/using-the-rest-api/frequently-asked-questions/#can-i-disable-the-rest-api recommends against disabling and doesn't even document how to anyway.. But have you seen sites that do disable it for security reasons or something?
@paulirish here are some insights
Link is generated by this function
function rest_output_link_header() {
if ( headers_sent() ) {
return;
}
$api_root = get_rest_url(); // <--
if ( empty( $api_root ) ) {
return;
}
header( 'Link: <' . esc_url_raw( $api_root ) . '>; rel="https://api.w.org/"', false );
}
get_rest_url()
allow developper to apply their own filters on its return value: https://github.com/WordPress/WordPress/blob/fe73f310d4502c978650da998fe985c9c6f9dba0/wp-includes/rest-api.php#L403
So any developer could write a filter that simply returns false
and the header won't get emitted.
There are also a few extensions out there for that: https://wordpress.org/plugins/disable-json-api/#description
@paulirish ->
But have you seen sites that do disable it for security reasons or something?
Yes, some plugins do this for "security" reasons. That's a) a bad idea with little upside b) probably going to be combined with security through slight obscurity measures like changing the location of wp-login.php or wp-content dir. c) Not super common.
@machour and @Shelob9 very useful, thank you.
For anyone else visiting this thread.. I'm still interested in ideas regarding detection with JS.
For example:
// passes if any favicons, stylesheets are provided by the theme
!!document.querySelector('link[href*="wp-content"]')
// same but including scripts, too..
!!document.querySelectorAll('link[href*="wp-content"], script[src*="wp-content"]').length
It's probably possible for a WP site to not trigger the second detect, though IMO it'd be very uncommon.
Note that may both fail for WordPress sites that are using the AMP plugin, since AMP disallows external stylesheets and custom scripts. If they haven't set a favicon, then there won't be any such links for icons.
On the other hand, the majority of WordPress sites keep the generator meta tag intact:
<meta name="generator" content="WordPress 5.0.3">
So the very first thing to check for is whether it exists in the page:
!!document.querySelector('meta[name=generator][content^="WordPress"]')
@paulirish images uploaded through WordPress are usually available under "/wp-content/uploads/", so you may want to extend the selector to check img[src*="wp-content"]
as well.
wp-content is the default it can be changed by setting a constant in wp-config. wp-includes (effectively) can't be changed.
So it looks like we can do something like this:
if (`<meta name="generator" content="WordPress 5.0.3">`) {
// get WordPress version
} else if (`wp-includes` tags are present) {
// unknown version, but WordPress is being used
}
I think that should work, especially since wp-includes
can't be changed according to @Shelob9. Let me know what you all think.
A PR with this in place: https://github.com/johnmichel/Library-Detector-for-Chrome/pull/131/files
I think going for file requests is the only (halfway) reliable way here. Not showing the meta generator in the source is a pretty good idea from a security POV, because you don't want to hand the "bad guys" scraping info on a silver platter. I also wonder how to handle CMS that give the user 100% control over their sourcecode.
Maybe the way web-policies work could be an option here. So lighthouse provides a list of whitelisted hosts that request the site and the CMS returns a header to identify itself. CMS vendors could then update the list of trusted hosts on their end to make sure the header is only sent to trusted sources.
@Lefaux I agree that current logic might miss some sites, but we're to shooting for 100%.
If you have a prod policy that strips any and all identifiers, a tool auditing the public site may not detect it as such — WAI. However, LH can also be run locally and against development environments; the prod policy can be configured to provide the identifier to your IP or based on logged in state, etc. Which is to say, it's possible to configure the environment to emit the necessary signals, if you're motivated to do so..
The whitelist solution won't scale to a distibuted & self-hosted ecosystem, and it opens a yet different can of worms about out of band requests to a 3P.
Yeah, I see the problem with scaling, too. It's not about having a strict policy on our end, it's just that quite a few users of TYPO3 simply disable the generator tag because of their policies :) Another concern I have is about bloating the source for every request, even though only a fraction of requests actually come from Lighthouse. I mean... Google asks us to fight for every byte (and I completely agree here). If you think I'm taking this request too serious, let me know (fighting for every byte is hard, and naturally I'm lazy :).
Since the whitelisting stuff is not really an option what's your thought on having a XMLHttpRequest
against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.
Since the whitelisting stuff is not really an option what's your thought on having a XMLHttpRequest against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.
Yeah, why not check whether /wp-admin/
does not deliver a 404.
Since the whitelisting stuff is not really an option what's your thought on having a
XMLHttpRequest
against an endpoint on the CMS side of things that LH could do to determine which system it's talking to.
How would one protect such an endpoint? As in, exposing such a mechanism would work against the very reason why you were stripping the platform+version information. Further, I don't think we can or should rely on new endpoints..
Yeah, why not check whether /wp-admin/ does not deliver a 404.
That requires an additional out of band request which, while not impossible, is something we'd like to avoid.
Stepping back, I'll come back to what I said earlier: if your site is designed to hide all platform information, then the fact that LH is not able to detect it is not a bug, it's WAI. Despite that, developers that want to see stack specific advice can still get access to it.. by, for example, configuring the environment to expose those signals under certain conditions. Alternatively, one could also imagine a UI where you can manually pick which stack pack strings LH shows, even if LH is not able to detect that platform itself.
Does that seem reasonable? :)
One requirement for adding a stack pack is reliably detecting that the stack/library/platform is being used by the page. We want this detection to be as reliable and bulletproof as possible.
Wappalyzer uses a few approaches which seem overkill and not something we can reuse. We'd like something much more lightweight.
Primary question: Can we detect wordpress via via clientside JS running in the page? (Naturally, it has full access to
window
and the DOM.)Secondary question: Is there another reliable detect based on the network request metadata? We'd like to avoid parsing the response of any network resources (so no looking for patterns in HTML, JS or CSS files). But considering response headers or paths in urls (like wp-content, etc) is fine.
Could some WordPress experts chime in?