freedomofpress / securethenews

An automated scanner and web dashboard for tracking TLS deployment across news organizations
https://securethe.news
GNU Affero General Public License v3.0
100 stars 25 forks source link

Feed results into HTTPS Everywhere #140

Open Hainish opened 6 years ago

Hainish commented 6 years ago

Since you're already maintaining a list of news sites with HTTPS support, you could easily auto-generate rulesets for HTTPS Everywhere.

Any site which is fully available over HTTPS (e.g. no content is unavailable when a domain is only loaded over HTTPS), but not HSTS preloaded is eligible for inclusion in HTTPS Everywhere.

To create a new HTTPS Everywhere ruleset, you can clone the repo and run a simple ruleset generation script:

git clone git@github.com:EFForg/https-everywhere.git
cd https-everywhere/rules
./make-trivial-rule ExampleNewsSite.com

You can follow the common format generated from this example to create rules for other sites and subdomains. Refer to https://github.com/EFForg/https-everywhere/blob/master/CONTRIBUTING.md for full contribution documentation.

thisisparker commented 6 years ago

Hi Bill! Excited to get a chance to work on this :) It's pretty close to an opportunity I identified back in February so really I feel like I'm already six months behind on delivering.

I hope that identifying which domains are eligible for new rules is as easy as you suggest, but I'm worried about how we could pick out which sites are "fully available over HTTPS." Before we figure out how to automate this, I'd like to walk through what I'd do to generate a one-time dump of new rules.

The scorecard has a field called "Available over HTTPS" which is actually a combination of the scraper properties "valid_https" (which must be true) and "downgrades_https" (which must be false). That's certainly a start — of the 131 sites on the scorecard, fully half (66) are in the Goldilocks zone of being available over HTTPS but not HSTS preloaded.

Of those 66, about 44 look like they already have rules. Without delving into the contents of the XML, these domains or their slug is already found in the name of a rule. I've listed them at the bottom of this issue.

That leaves about 20 sites that might be rule-eligible. That seems like the ceiling, too, as poking around will certainly shake loose rules that are slightly irregularly named, or in some cases sites that our scanner identifies as "available over HTTPS" but which aren't "fully available over HTTPS," as you specify.

This is probably a small enough number that it makes sense to check those 20 or so domains to confirm that (a) they actually don't have a rule, and (b) confirm that everything works over HTTPS. Unless that number changes dramatically, because STN starts tracking many more sites or something like that, my inclination would be to just manually repeat this process every once in a while.

How does that work?

Eligible domains that may not have existing rules

axios.com
cbsnews.com
cnet.com
indiatimes.com
infobae.com
oglobo.globo.com
gazzetta.it
lastampa.it
mic.com
nzz.ch
onet.pl
scroll.in
techcrunch.com
theatlantic.com
theguardian.com
nytimes.com
thetimes.co.uk
thestar.com
theundefeated.com
univision.com
usnews.com
washingtonpost.com

Eligible domains with existing rules found

abcnews.go.com
alarabiya.net
arstechnica.com                                                                                                                                                     
ap.org
bloomberg.com
bostonglobe.com
buzzfeed.com
cnbc.com
cnn.com
welt.de
elpais.com
ft.com
forbes.com
foxnews.com
gizmodo.com
golem.de
heise.de
hongkongfp.com
lemonde.fr
nbcnews.com
nypost.com
nj.com
nrk.no
politico.com
propublica.org
qz.com
reuters.com
salon.com
taz.de
thedailybeast.com
theglobeandmail.com
independent.co.uk
themoscowtimes.com
newyorker.com
theverge.com
wsj.com
weather.com
usatoday.com
vanityfair.com
vice.com
vox.com
washingtontimes.com
wired.com
wp.pl
thisisparker commented 6 years ago

Sorry, didn't mean to close!

brainwane commented 6 years ago

@thisisparker We spoke a couple weeks ago about your PRs and whether any of the ruleset generation scripts helped you get further -- any progress?