chromium / hstspreload

🔒🔍 A Go package to scan sites against requirements for Chromium-maintained HSTS preload list.
https://hstspreload.org
BSD 3-Clause "New" or "Revised" License
114 stars 36 forks source link

Define unique user-agent #107

Open danDanV1 opened 6 years ago

danDanV1 commented 6 years ago

The user agent for hstspreload requests is generic user-agent: Go-http-client/2.0

Can this be set to something specific to identify the bot? This would enable server admins to whitelist the bot if necessary and distinquish from any otherbot using the go http library.

Can I suggest hstspreload client/2.0 ??

lgarron commented 6 years ago

So far, we've tried to encourage configurations that did not depend on any feature of the client, especially things like the user agent or source IP. The HSTS preload list website has no promises (only requirements), and there are no guarantees any particular part of the system will remain the same in the future.

This would enable server admins to whitelist the bot if necessary and distinquish from any otherbot using the go http library.

It is safest helps if every (modern) browser and everyone connecting using other libraries (e.g. via the hstspreload commandline tool) gets the same responses with the same headers. In particular, a site will not end up on Mozilla's HSTS preload list unless their scanner is able to observe the header.

This also brings to mind the issue is that people copy-paste recommendations from others on the internet. If these "recommended" configurations have a specific user agent they can start sniffing, preloading issues could become very difficult to debug. (I try to track bad recommendations on the web and ask their owners to improve them, but this is a manual, imperfect process.)

For these reasons, I'm against tweaking the user agent. However, I'll let @nharper have final say about it.

lightswitch05 commented 6 years ago

In my case, having a custom user agent would have prevented the bot from being blacklisted. Being the default Go-http-client/1.1 and Go-http-client/2.0 was flagged as someone scrapping the site. Blocking based off user agent is a quick fix - although not a really great one. Anyways, It probably was a bot scrapping the site, but by blacklisting it, this was thrown under the bus as well

lgarron commented 6 years ago

@nharper, do you have an opinion about this either way?

nharper commented 6 years ago

I don't think the hstspreload tool needs to specify its own UA. In general, I don't like servers sniffing UA strings.

devonobrien commented 3 years ago

Reopening this issue as we now have a more compelling reason to reconsider custom UA strings.

The outbound scans used by hstspreload.org and the bulk updates to check for preloading eligibility have started to be blocked by several CDNs' spam/fraud detection. These CDNs only offer allowlisting by (User-Agent, ASN)-tuples, and they are understandably not a fan of allowlisting the default go UA string. While I agree that we do not want to encourage UA sniffing generally speaking, I don't think we have many other options here. Once we get a custom string, we still need to reach out to the affected CDNs to start the process of unblocking.

I'd not seen this discussion before filing #118 and tweaking the UA string to "user-agent: hsts-preload-bot" on a new branch, but I'm happy to settle on a more amenable custom string, if folks have a strong opinion on what it should be.

lgarron commented 3 years ago

Reopening this issue as we now have a more compelling reason to reconsider custom UA strings.

I think now's a good time!

This has turned int something people ask for more regularly.

We could still discourage UA detection by scanning using the default user agent first, and e.g. redoing the whole scan using the custom UA if there is a relevant failure.

nharper commented 3 years ago

If we want to change the UA string from the default golang string, we could also consider scanning using a few common UA strings from browsers. This way the behavior observed by the probe would more closely match what browsers would see in the real world.

lgarron commented 3 years ago

If we want to change the UA string from the default golang string, we could also consider scanning using a few common UA strings from browsers. This way the behavior observed by the probe would more closely match what browsers would see in the real world.

Sounds like a good idea! Maybe also include things like curl, which benefits more from dynamic HSTS than actual browsers (who have preload lists)? :-D

jdeblasio commented 3 years ago

From my perspective, scanning multiple times with multiple UAs feels a little silly to me. The number one reason to scan at all is to ensure that the site has authorized preloading. As long as that works with any UA, I'm comfortable saying that we're authorized. Conversely, scanning n times adds a bunch of additional complexity. Besides the obvious code complexity, there's also more legwork for maintainers when the emails with harder-to-debug failures start rolling in.

That's not to say that I'm 100% opposed to this, but I'm not totally clear on what problem scanning with multiple UAs would actually solve.

lgarron commented 3 years ago

@jdeblasio hstspreload.org has always scanned for more than the super-basic requirements, and issued errors or warnings for practices that could leave users unprotected.

If a site is dynamically calculating whether to send an HSTS header, then users with a client that doesn't have preloaded HSTS are more likely to be unprotected because they're not getting dynamic HSTS. (This is getting less and less of a concern, but it certainly has had its value.)

Also, dynamic HSTS configuration means that the header may change or get dropped by accident. We had to specifically add a guard for the removal criteria because this was happening too often, and I think it would be good to encourage sending the header as unconditionally as possible.

nharper commented 3 years ago

The outbound scans used by hstspreload.org and the bulk updates to check for preloading eligibility have started to be blocked by several CDNs' spam/fraud detection. These CDNs only offer allowlisting by (User-Agent, ASN)-tuples, and they are understandably not a fan of allowlisting the default go UA string.

I'm assuming (perhaps incorrectly) that CDNs are adding the STS header based on a configuration option. Could you work with CDNs so that the HTTP response they send in spam/fraud cases still includes the STS header, i.e. apply the STS header before the spam/fraud check? (This is also assuming that the CDN's response to such a request is an HTTP response vs closing a connection or similar.) That would be more in line with the philosophy that an STS header should be set unconditionally on a domain.

jdeblasio commented 3 years ago

I think I'd like to argue that we should spin off the "check multiple fetch with multiple UAs" idea into a separate feature request.

I 100% agree that scanning for more than super-basic requirements is great, and that this could help solve a real issue that occurs in some cases. There are just also some additional risks. One thing I'm worried about, for instance, is that we'll run into CDNs who aren't enthusiastic about allowlisting fetches that look like they're from a bot but are using a browser-like UA string. If we encounter that, then we've obligated ourselves to either bake in ways to account for those CDNs (more complexity), or remove the check (wasted effort).

Separate from that improvement is the present buggy reality that some folks behind CDNs can't preload their domains without manual intervention because those requests are getting blocked.

The former is a cool nice-to-have. The latter needs addressing pretty urgently.

lgarron commented 3 years ago

The former is a cool nice-to-have. The latter needs addressing pretty urgently.

Could I ask what makes it urgent? I think it's worth looking at solutions, but we've successfully asked sites to handle this on their end for over half a decade.

Do we know what CDNs are causing most of the issues? Is it e.g. mostly Cloudflare?

We could consider asking them if they would apply the domain's HSTS setting to their interception page. This would mean we don't see the correct response and redirect chain, but it's another option.

lgarron commented 3 years ago

In any case, I offer this strawperson:

jdeblasio commented 3 years ago

There's another reality here: we don't have a ton of cycles for HSTS preload stuff right now. We (the Chromium-based maintainers) are definitely committed to supporting the list for as long as it's valuable, and we might be able to give it more cycles in the future, but presently we're looking to get the most value per (very little) time spent.

Setting a single UA header is a trivial change that meets the present need. We'd also be delighted to receive PRs for more comprehensive solutions.

devonobrien commented 3 years ago

From my perspective, setting a hstspreload-specific UA string gets us an immediate win with virtually no downside and whether sites selectively serve headers based on UA string is a bit of an orthogonal issue to what hstspreload uses. We can consider fancier approaches later if we can articulate benefits that are worth the implementation effort.

Site operators are already responsible for the consequences of "bad" HSTS behavior like ignoring the deployment recommendations when submitting their domain for preloading, regardless of whether they selectively serve headers based on UA. The immediate need we have now is for hstspreload.org and our bulk update infrastructure to be identifiable so it s header checks can be unblocked at the CDN level. We've so far identified 2 CDNs (including Cloudflare) that are known to be blocking requests, and after discussing it with them, the established way to circumvent this for bots is to allowlist based on ASN and UA string.

If there are no objections to this immediate path forwards, I suggest we: