Closed foxdavidj closed 3 years ago
I'd like to volunteer as an analyst. I've used HTTPArchive in some of my (academic) research, so I have some familiarity with the datasets.
@rviscomi as spoken recently, I would also like to join in this chapter.
@tomvangoethem added you as an analyst :)
Hello Team. I would like to participate as a Reviewer please. :grinning:
@nrllh thank you for agreeing to be the lead author for the Security chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.
The immediate next steps for this chapter are:
There's a ton of info in the top comment, so check that out and feel free to ping myself or @obto with any questions!
I'm happy to review this chapter again this year btw. Added myself to first comment.
@ivanr @april would you have any interesting in helping out in this chapter this year? Last year's chapter for context: https://almanac.httparchive.org/en/2019/security
@tomvangoethem, assigned you also as author ;)
Hey @nrllh, just checking in:
@tomvangoethem @cqueern @bazzadp can you please request edit access and then credit yourself in Google Doc?
Hello team, may I contribute as a reviewer too?
@edmondwwchan welcome to the club! Please request an edit access and credit yourself in the doc.
@nrllh How's the chapter outline coming along? We want to have that wrapped up by the end of the week so we have time to set up our Web Crawler :)
As discussed on Slack I think we should ask for custom metrics for vulnCount, Library (including version), Library (excluding version) and highestSeverity from the no-vulnerable-libraries
lighthouse metric:
"no-vulnerable-libraries": {
"description": "Some third-party scripts may contain known security vulnerabilities that are easily identified and exploited by attackers. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/vulnerabilities).",
"title": "Includes front-end JavaScript libraries with known security vulnerabilities",
"score": 0,
"details": {
"items": [
{
"vulnCount": 4,
"detectedLib": {
"url": "https://snyk.io/vuln/npm:jquery?lh=1.4.4&utm_source=lighthouse&utm_medium=ref&utm_campaign=audit",
"text": "jQuery@1.4.4",
"type": "link"
},
"highestSeverity": "Medium"
}
],
"type": "table",
"headings": [
{
"text": "Library Version",
"itemType": "link",
"key": "detectedLib"
},
{
"text": "Vulnerability Count",
"itemType": "text",
"key": "vulnCount"
},
{
"text": "Highest Severity",
"itemType": "text",
"key": "highestSeverity"
}
],
"summary": []
},
"scoreDisplayMode": "binary",
"displayValue": "4 vulnerabilities detected",
"id": "no-vulnerable-libraries"
},
Related to https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/08_Security/08_40.sql and https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/08_Security/08_40b.sql but want MOAR detail which is more easily queryable
And also password-inputs-can-be-pasted-into
score from Lighthouse, but maybe less relevant from Home Screen which HTTP Archive is restricted to. Anyway to only include when site has a login login form field?
"password-inputs-can-be-pasted-into": {
"description": "Preventing password pasting undermines good security policy. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/password-pasting).",
"title": "Allows users to paste into password fields",
"score": 1,
"details": {
"items": [],
"type": "table",
"headings": []
},
"scoreDisplayMode": "binary",
"id": "password-inputs-can-be-pasted-into"
}
And external-anchors-use-rel-noopener
score and count of items also from Lighthouse
"external-anchors-use-rel-noopener": {
"description": "Add `rel=\"noopener\"` or `rel=\"noreferrer\"` to any external links to improve performance and prevent security vulnerabilities. [Learn more](https://developers.google.com/web/tools/lighthouse/audits/noopener).",
"warnings": [],
"title": "Links to cross-origin destinations are safe",
"score": 1,
"details": {
"items": [],
"type": "table",
"headings": []
},
"scoreDisplayMode": "binary",
"id": "external-anchors-use-rel-noopener"
}
@pmeenan not sure if you saw this Slack conversation but is it possible to configure the run to crawl additional meta-data URLs like /.well-known/change-password
and '/security.txt'
Probably not as part of the crawl since those aren't actually loaded by the page itself. A custom metric might be able to pull them with fetch since they are same-origin but it kind of feels like something that a separate curl script run once against the URL list might be better for.
I can volunteer as an additional analyst if you need it
@aagar you are welcome!
TLS π
Security Headers π
Cookies πͺ
WebAssembly π
Lighthouse π‘
Cross-Site-Request-Forgery ποΈ
Information Leakage βΉοΈ
Other β
@tomvangoethem @cqueern @bazzadp @edmondwwchan @AAgar
I just migrated the metrics from last year's Almanac and added some new points. Please review it and make some recommendations ^^
@nrllh looks pretty great. This is exciting. I am for sure interested in SRI on subresources so look forward to seeing how we can address it.
SRI is so over-rated and pointless IMHO. If you can self-host you should do. If you can't because it changes frequently, then can't use SRI. So what's the point? Anyway I digress...
@bazzadp - I think it's worth including SRI usage in chapter and if usage are low, may be the reasons are as stated by you in comment above and we should add as a food for thought in chapter (may be)
Oh not saying don't include it, just interjecting personal opinion to the conversation π As I said a digression. Back to chapter planning!
@nrllh your list looks pretty good. Particularly interested in the following metrics but uncertain if it's easy/possible to measure from the HTTP archive dataset:
@nrllh @tomvangoethem - Any thoughts on including Bot detection solutions (e.g. PerimeterX / CyberFend) in scope of this chapter? We can do a quick PR in Wappalyzer for top vendors. I raised a feature request to add 'Security' as category in Wappalyzer - https://github.com/AliasIO/wappalyzer/issues/3226
@edmondwwchan I'm also not sure if we can measure them, there are now some limitations for table requests, I also couldn't check it, but I add these to the list.
@rockeynebhwani It's a good idea, but I don't know if we have enough data delivered by Wappalyzer. Following query delivers five rows and the distribution doesn't look so relevant:
SELECT app, count(app) as count FROM httparchive.technologies.2020_06_01_desktop where category='Captchas' group by app
Even if Wappalyzer recognizes in the next weeks (or months) more products for this category, I don't know if the HTTPArchive crawler will have the chance to support it (@bazzadp?)
Few other thoughts:
@nrllh - as I understand from @rviscomi that if we are able to work on the issue I raised on Wappalyzer Github and submit a PR next week, HTTPArchive crawler for 1st Aug will be able to provide us this insight.
@nrllh - WebAuthn adoption will be interesting one. I remember reading eBay implementing it (https://tech.ebayinc.com/product/ebay-makes-mobile-web-login-easier/) and support on Safari web is coming in iOS14, so adoption should increase in coming days.
I don't know how to detect this. I can see webAuthn references on eBay (https://www.ebay.com/signin/)
@senthilp any idea?
@rockeynebhwani there are two main calls for WebAuthn (navigator.credentials.create
and navigator.credentials.get
), I could totally find 2442 URLs that contain javascript files with these calls. But of course, this doesn't mean all these websites provide this functionality for their users.
Ideally we shouldn't be scanning the response bodies for patterns. It's expensive and flaky. In this case I wonder if there are existing Blink feature counters that we can query. For example maybe CredentialManagerGetReturnedCredential?
@rviscomi it seems there are some security-related feature counters, but it's hard to get the context of these features, they really are not well documented?
@tomvangoethem @cqueern @bazzadp @edmondwwchan @AAgar
I updated the list of metrics. Let's try to get the final version for that please by Tuesday. The core team needs some time to configure the crawler. I also introduced sections for outline (based on our metrics) in our Doc. Please feel free to edit it.
@nrllh @rockeynebhwani The problem with looking at tech like Captcha's is many of these are run on Contact Us forms and the like. Many of which are not found on a homepage, but instead on a subpage like /contact-us/.
Since we only look at the homepage of every site we crawl, the data we'd gather and report could be wildly different than what the real usage numbers are.
@obto - Agree. Very few sites deploy bot protection site wide. Only question we will be able to answer what % of sites have bot protection deployed on HomePage.
Or, wishful thinking for future if .well-known/change-password standard picks up,
An example WPT where I tried to fetch www.apple.com/.well-known/change-password using WPT custom script as apple has adopted this - https://www.webpagetest.org/result/200719_DR_36cb98ba5bc7d2b3dfba19511bdf26b3/1/details/#step1_request20
A site is bound to have bot protection on such pages but we will still miss sites where there is no login function
@nrllh comments on your list
Cookies - think we can do a bit more here, especially as Cookie chapter was closed. How many cookies are set? What size are they? How many are 1st party, how many 3rd party? How many 3rd party cookies does the average site set? What are the difference between the attributes for 3rd party versus 1st party? I'd imagine most 3rd party analytics don't set Secure
flag for example. Also we'll need to clarify this section that it's run from US servers. That applies to whole chapter (and whole Web Almanac) but I think especially for cookies given some regions (like EU) have much stricter rules on setting cookies without consent and sites are starting to respect those laws.
Crypto-miner - like this, but why under WASM category? It's happening under JavaScript too.
Information Leakage - presume you're talking about the Server
and X-Powered-By
type headers? Do we have a list of them? I have:
Server
X-Powered-By
X-AspNetMvcVersion
X-AspNetVersion
Any others?Also, as mentioned previously, that's a long list. Think it's good to run it all, especially as most are already available from last year's queries, but personally I would like to be more selective as to what we actually publish to avoid just being a listing of settings and instead all you to give more commentary on a shorter list. Think there's more value in that.
@AAgar @tomvangoethem we should look to convert the expensive and unreliable body scanning metrics to custom metrics. The security ones are here and the media analysts have opened a pull request for their custom metrics which will serve as a good template of what to do. This needs to be in place by August 1st so can one or both of you work on it this week? Once you've got used to how to convert the existing ones, that should stand you in good stead to look at the new metrics and see if they need any of the same.
@bazzadp
Cookies...
I'll update the list.
Crypto-miner - like this, but why under WASM category? It's happening under JavaScript too.
WASM chapter is closed and I think, it's better to have it as a category here, we want to also share some stats about WASM's usage (vs. usage by crypto-miner).
Information Leakage
We have them in the table technologies
as:
We may also include the categories CMS
, Blog
etc...
Think there's more value in that.
π
For cryptominers, I assume that we will base this on URL patterns, i.e. only consider known cryptominers? Are there any metrics on CPU usage when the site is visited? Could be useful to figure out if the cryptominer was started from the start.
I would suggest to change the "Information Leakage" section to "Outdatedness". Showing which version of web application you're running is not really leaking that much information (e.g. compared to having a private key in an HTML comment). Perhaps we could also run a query on the latter by running a regex on all response bodies; there are some useful regexes in the gitleaks repo
FYI we have access to the cryptominers technology detection from Wappalyzer.
I tried to add a bit more structure to the outline to make things more generic; I think it makes it easier to reason about things. Let me know what you think; should I add it to the doc?
Secure
attribute on cookies__Secure-
prefix on cookies__Host-
prefixSecure
cookie?*-src
directives)<iframe>
sandboxframe-ancestors
, trusted types, ...)For custom metrics, maybe we also want to extract all <meta http-equiv="...">
elements? From the ones that @bazzadp linked, only the integrity
attribute is needed
@agektmr had some ideas for detecting WebAuthn via feature counters. He suggests using CredentialManagerGetPublicKeyCredential
. Thank you Eiji!
@tomvangoethem this looks like a great list! I think it's a good idea to add it to the doc to make it easier to comment/iterate on specific parts.
@rviscomi is there such thing as too much content? Or is that a question for later in the process?
Good question @cqueern. I think it's ok to have a lot of content, as long as the content team has the bandwidth to support writing/reviewing/analyzing it. For reference, here's how the chapters looked last year in terms of length:
Like the @tomvangoethem's idea on how we can organize the materials. From the 2019's introduction we have the goals for this chapter:
From my point of view, "Drivers of security mechanism adoption" is more like a methodology to explain why a security feature is adopted. Perhaps the measurement results can be useful to conclude this chapter and to give some pointers in subsections to explain some of the observations.
Also, I would suggest adding the discussion of "Bad security practices on the web" as the goal for this year.
@tomvangoethem well done. I added it for you to doc, with some additional points and comments.
cc @cqueern @bazzadp @edmondwwchan @AAgar
@edmondwwchan As for bad security practices, a few of the metrics might have overlap with current status of security on the web and available features, e.g. RSA vs ECDSA could be considered better security but it'd fall under current status (percentage type result), yet we could also classify using primarily RSA as a bad security practice.
Part II Chapter 11: Security
Content team
Content team lead: @nrllh
Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.
The content team is made up of the following contributors:
New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.
Note: To ensure that you get notifications when tagged, you must be "watching" this repository.
Milestones
0. Form the content team
1. Plan content
2. Gather data
3. Validate results
4. Draft content
5. Publication