HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
611 stars 169 forks source link

Security 2020 #906

Closed foxdavidj closed 3 years ago

foxdavidj commented 4 years ago

Part II Chapter 11: Security

Content team

Authors Reviewers Analysts Draft Queries Results
@nrllh @tomvangoethem @cqueern @bazzadp @edmondwwchan @AAgar @tomvangoethem Doc *.sql Sheet

Content team lead: @nrllh

Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.

The content team is made up of the following contributors:

New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.

Note: To ensure that you get notifications when tagged, you must be "watching" this repository.

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

tomvangoethem commented 4 years ago

I'd like to volunteer as an analyst. I've used HTTPArchive in some of my (academic) research, so I have some familiarity with the datasets.

nrllh commented 4 years ago

@rviscomi as spoken recently, I would also like to join in this chapter.

foxdavidj commented 4 years ago

@tomvangoethem added you as an analyst :)

cqueern commented 4 years ago

Hello Team. I would like to participate as a Reviewer please. :grinning:

rviscomi commented 4 years ago

@nrllh thank you for agreeing to be the lead author for the Security chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.

The immediate next steps for this chapter are:

  1. Establish the rest of your content team. The larger the scope of the chapter, the more people you'll want to have on board.
  2. Start sketching out ideas in your draft doc.
  3. Catch up on last year's chapter and the project methodology to get a sense for what's possible.

There's a ton of info in the top comment, so check that out and feel free to ping myself or @obto with any questions!

tunetheweb commented 4 years ago

I'm happy to review this chapter again this year btw. Added myself to first comment.

tunetheweb commented 4 years ago

@ivanr @april would you have any interesting in helping out in this chapter this year? Last year's chapter for context: https://almanac.httparchive.org/en/2019/security

nrllh commented 4 years ago

@tomvangoethem, assigned you also as author ;)

foxdavidj commented 4 years ago

Hey @nrllh, just checking in:

  1. How is the the chapter coming along? We're tying to have the outline and metrics settled on by the end of the week so we have time to configure the Web Crawler to track everything you need.
  2. Can you remind your team to properly add and credit themselves in your chapter's Google Doc?
  3. Anything you need from me to keep things moving forward?
nrllh commented 4 years ago

@tomvangoethem @cqueern @bazzadp can you please request edit access and then credit yourself in Google Doc?

edmondwwchan commented 4 years ago

Hello team, may I contribute as a reviewer too?

nrllh commented 4 years ago

@edmondwwchan welcome to the club! Please request an edit access and credit yourself in the doc.

foxdavidj commented 4 years ago

@nrllh How's the chapter outline coming along? We want to have that wrapped up by the end of the week so we have time to set up our Web Crawler :)

tunetheweb commented 4 years ago

As discussed on Slack I think we should ask for custom metrics for vulnCount, Library (including version), Library (excluding version) and highestSeverity from the no-vulnerable-libraries lighthouse metric:

"no-vulnerable-libraries": {
      "description": "Some third-party scripts may contain known security vulnerabilities that are easily identified and exploited by attackers. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/vulnerabilities).",
      "title": "Includes front-end JavaScript libraries with known security vulnerabilities",
      "score": 0,
      "details": {
        "items": [
          {
            "vulnCount": 4,
            "detectedLib": {
              "url": "https://snyk.io/vuln/npm:jquery?lh=1.4.4&utm_source=lighthouse&utm_medium=ref&utm_campaign=audit",
              "text": "jQuery@1.4.4",
              "type": "link"
            },
            "highestSeverity": "Medium"
          }
        ],
        "type": "table",
        "headings": [
          {
            "text": "Library Version",
            "itemType": "link",
            "key": "detectedLib"
          },
          {
            "text": "Vulnerability Count",
            "itemType": "text",
            "key": "vulnCount"
          },
          {
            "text": "Highest Severity",
            "itemType": "text",
            "key": "highestSeverity"
          }
        ],
        "summary": []
      },
      "scoreDisplayMode": "binary",
      "displayValue": "4 vulnerabilities detected",
      "id": "no-vulnerable-libraries"
    },

Related to https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/08_Security/08_40.sql and https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/08_Security/08_40b.sql but want MOAR detail which is more easily queryable

tunetheweb commented 4 years ago

And also password-inputs-can-be-pasted-into score from Lighthouse, but maybe less relevant from Home Screen which HTTP Archive is restricted to. Anyway to only include when site has a login login form field?

"password-inputs-can-be-pasted-into": {
      "description": "Preventing password pasting undermines good security policy. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/password-pasting).",
      "title": "Allows users to paste into password fields",
      "score": 1,
      "details": {
        "items": [],
        "type": "table",
        "headings": []
      },
      "scoreDisplayMode": "binary",
      "id": "password-inputs-can-be-pasted-into"
    }
tunetheweb commented 4 years ago

And external-anchors-use-rel-noopener score and count of items also from Lighthouse

    "external-anchors-use-rel-noopener": {
      "description": "Add `rel=\"noopener\"` or `rel=\"noreferrer\"` to any external links to improve performance and prevent security vulnerabilities. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/noopener).",
      "warnings": [],
      "title": "Links to cross-origin destinations are safe",
      "score": 1,
      "details": {
        "items": [],
        "type": "table",
        "headings": []
      },
      "scoreDisplayMode": "binary",
      "id": "external-anchors-use-rel-noopener"
    }
tunetheweb commented 4 years ago

@pmeenan not sure if you saw this Slack conversation but is it possible to configure the run to crawl additional meta-data URLs like /.well-known/change-password and '/security.txt'

pmeenan commented 4 years ago

Probably not as part of the crawl since those aren't actually loaded by the page itself. A custom metric might be able to pull them with fetch since they are same-origin but it kind of feels like something that a separate curl script run once against the URL list might be better for.

AAgar commented 4 years ago

I can volunteer as an additional analyst if you need it

nrllh commented 4 years ago

@aagar you are welcome!

nrllh commented 4 years ago

Current list of metrics:

TLS πŸ”’

Security Headers πŸ“‹

Cookies πŸͺ

WebAssembly πŸš€

Lighthouse πŸ’‘

Cross-Site-Request-Forgery 🎟️

Information Leakage ℹ️

Other ❓

nrllh commented 4 years ago

@tomvangoethem @cqueern @bazzadp @edmondwwchan @AAgar

I just migrated the metrics from last year's Almanac and added some new points. Please review it and make some recommendations ^^

cqueern commented 4 years ago

@nrllh looks pretty great. This is exciting. I am for sure interested in SRI on subresources so look forward to seeing how we can address it.

tunetheweb commented 4 years ago

SRI is so over-rated and pointless IMHO. If you can self-host you should do. If you can't because it changes frequently, then can't use SRI. So what's the point? Anyway I digress...

rockeynebhwani commented 4 years ago

@bazzadp - I think it's worth including SRI usage in chapter and if usage are low, may be the reasons are as stated by you in comment above and we should add as a food for thought in chapter (may be)

tunetheweb commented 4 years ago

Oh not saying don't include it, just interjecting personal opinion to the conversation πŸ˜€ As I said a digression. Back to chapter planning!

edmondwwchan commented 4 years ago

@nrllh your list looks pretty good. Particularly interested in the following metrics but uncertain if it's easy/possible to measure from the HTTP archive dataset:

rockeynebhwani commented 4 years ago

@nrllh @tomvangoethem - Any thoughts on including Bot detection solutions (e.g. PerimeterX / CyberFend) in scope of this chapter? We can do a quick PR in Wappalyzer for top vendors. I raised a feature request to add 'Security' as category in Wappalyzer - https://github.com/AliasIO/wappalyzer/issues/3226

nrllh commented 4 years ago

@edmondwwchan I'm also not sure if we can measure them, there are now some limitations for table requests, I also couldn't check it, but I add these to the list.

@rockeynebhwani It's a good idea, but I don't know if we have enough data delivered by Wappalyzer. Following query delivers five rows and the distribution doesn't look so relevant:

SELECT app, count(app) as count FROM httparchive.technologies.2020_06_01_desktop where category='Captchas' group by app

grafik

Even if Wappalyzer recognizes in the next weeks (or months) more products for this category, I don't know if the HTTPArchive crawler will have the chance to support it (@bazzadp?)

Few other thoughts:

rockeynebhwani commented 4 years ago

@nrllh - as I understand from @rviscomi that if we are able to work on the issue I raised on Wappalyzer Github and submit a PR next week, HTTPArchive crawler for 1st Aug will be able to provide us this insight.

rockeynebhwani commented 4 years ago

@nrllh - WebAuthn adoption will be interesting one. I remember reading eBay implementing it (https://tech.ebayinc.com/product/ebay-makes-mobile-web-login-easier/) and support on Safari web is coming in iOS14, so adoption should increase in coming days.

I don't know how to detect this. I can see webAuthn references on eBay (https://www.ebay.com/signin/)

image

@senthilp any idea?

nrllh commented 4 years ago

@rockeynebhwani there are two main calls for WebAuthn (navigator.credentials.createand navigator.credentials.get), I could totally find 2442 URLs that contain javascript files with these calls. But of course, this doesn't mean all these websites provide this functionality for their users.

rviscomi commented 4 years ago

Ideally we shouldn't be scanning the response bodies for patterns. It's expensive and flaky. In this case I wonder if there are existing Blink feature counters that we can query. For example maybe CredentialManagerGetReturnedCredential?

nrllh commented 4 years ago

@rviscomi it seems there are some security-related feature counters, but it's hard to get the context of these features, they really are not well documented?

@tomvangoethem @cqueern @bazzadp @edmondwwchan @AAgar

I updated the list of metrics. Let's try to get the final version for that please by Tuesday. The core team needs some time to configure the crawler. I also introduced sections for outline (based on our metrics) in our Doc. Please feel free to edit it.

foxdavidj commented 4 years ago

@nrllh @rockeynebhwani The problem with looking at tech like Captcha's is many of these are run on Contact Us forms and the like. Many of which are not found on a homepage, but instead on a subpage like /contact-us/.

Since we only look at the homepage of every site we crawl, the data we'd gather and report could be wildly different than what the real usage numbers are.

rockeynebhwani commented 4 years ago

@obto - Agree. Very few sites deploy bot protection site wide. Only question we will be able to answer what % of sites have bot protection deployed on HomePage.

Or, wishful thinking for future if .well-known/change-password standard picks up,

An example WPT where I tried to fetch www.apple.com/.well-known/change-password using WPT custom script as apple has adopted this - https://www.webpagetest.org/result/200719_DR_36cb98ba5bc7d2b3dfba19511bdf26b3/1/details/#step1_request20

A site is bound to have bot protection on such pages but we will still miss sites where there is no login function

tunetheweb commented 4 years ago

@nrllh comments on your list

Cookies - think we can do a bit more here, especially as Cookie chapter was closed. How many cookies are set? What size are they? How many are 1st party, how many 3rd party? How many 3rd party cookies does the average site set? What are the difference between the attributes for 3rd party versus 1st party? I'd imagine most 3rd party analytics don't set Secure flag for example. Also we'll need to clarify this section that it's run from US servers. That applies to whole chapter (and whole Web Almanac) but I think especially for cookies given some regions (like EU) have much stricter rules on setting cookies without consent and sites are starting to respect those laws.

Crypto-miner - like this, but why under WASM category? It's happening under JavaScript too.

Information Leakage - presume you're talking about the Server and X-Powered-By type headers? Do we have a list of them? I have:

Also, as mentioned previously, that's a long list. Think it's good to run it all, especially as most are already available from last year's queries, but personally I would like to be more selective as to what we actually publish to avoid just being a listing of settings and instead all you to give more commentary on a shorter list. Think there's more value in that.

tunetheweb commented 4 years ago

@AAgar @tomvangoethem we should look to convert the expensive and unreliable body scanning metrics to custom metrics. The security ones are here and the media analysts have opened a pull request for their custom metrics which will serve as a good template of what to do. This needs to be in place by August 1st so can one or both of you work on it this week? Once you've got used to how to convert the existing ones, that should stand you in good stead to look at the new metrics and see if they need any of the same.

nrllh commented 4 years ago

@bazzadp

Cookies...

I'll update the list.

Crypto-miner - like this, but why under WASM category? It's happening under JavaScript too.

WASM chapter is closed and I think, it's better to have it as a category here, we want to also share some stats about WASM's usage (vs. usage by crypto-miner).

Information Leakage

We have them in the table technologies as:

We may also include the categories CMS, Blog etc...

Think there's more value in that.

πŸ‘Œ

tomvangoethem commented 4 years ago

For cryptominers, I assume that we will base this on URL patterns, i.e. only consider known cryptominers? Are there any metrics on CPU usage when the site is visited? Could be useful to figure out if the cryptominer was started from the start.

I would suggest to change the "Information Leakage" section to "Outdatedness". Showing which version of web application you're running is not really leaking that much information (e.g. compared to having a private key in an HTML comment). Perhaps we could also run a query on the latter by running a regex on all response bodies; there are some useful regexes in the gitleaks repo

rviscomi commented 4 years ago

FYI we have access to the cryptominers technology detection from Wappalyzer.

tomvangoethem commented 4 years ago

I tried to add a bit more structure to the outline to make things more generic; I think it makes it easier to reason about things. Let me know what you think; should I add it to the doc?

tomvangoethem commented 4 years ago

For custom metrics, maybe we also want to extract all <meta http-equiv="..."> elements? From the ones that @bazzadp linked, only the integrity attribute is needed

rviscomi commented 4 years ago

@agektmr had some ideas for detecting WebAuthn via feature counters. He suggests using CredentialManagerGetPublicKeyCredential. Thank you Eiji!

rviscomi commented 4 years ago

@tomvangoethem this looks like a great list! I think it's a good idea to add it to the doc to make it easier to comment/iterate on specific parts.

cqueern commented 4 years ago

@rviscomi is there such thing as too much content? Or is that a question for later in the process?

rviscomi commented 4 years ago

Good question @cqueern. I think it's ok to have a lot of content, as long as the content team has the bandwidth to support writing/reviewing/analyzing it. For reference, here's how the chapters looked last year in terms of length:

image

edmondwwchan commented 4 years ago

Like the @tomvangoethem's idea on how we can organize the materials. From the 2019's introduction we have the goals for this chapter:

From my point of view, "Drivers of security mechanism adoption" is more like a methodology to explain why a security feature is adopted. Perhaps the measurement results can be useful to conclude this chapter and to give some pointers in subsections to explain some of the observations.

Also, I would suggest adding the discussion of "Bad security practices on the web" as the goal for this year.

nrllh commented 4 years ago

@tomvangoethem well done. I added it for you to doc, with some additional points and comments.

cc @cqueern @bazzadp @edmondwwchan @AAgar

AAgar commented 4 years ago

@edmondwwchan As for bad security practices, a few of the metrics might have overlap with current status of security on the web and available features, e.g. RSA vs ECDSA could be considered better security but it'd fall under current status (percentage type result), yet we could also classify using primarily RSA as a bad security practice.