EFForg / https-everywhere

A browser extension that encrypts your communications with many websites that offer HTTPS but still allow unencrypted connections.
https://eff.org/https-everywhere
Other
3.36k stars 1.09k forks source link

how diverse is https-everywhere #8900

Closed 00h-i-r-a00 closed 7 years ago

00h-i-r-a00 commented 7 years ago

Hello,

I hope this is the right place for asking such a question but I was wondering about the diversity of the websites on HTTPS Everywhere. Are websites from other countries and languages well represented? Also, I was wondering how much of Top Alexa Websites are there in the list. Can I say with reasonable certainty that for example, Alexa's websites (lets say the top 10K), by virtue of being ranked highly must be there in the HTTPS Everywhere list? Just wanted to know if someone has any idea about this, or has ever thought about this, or whether there has been any study on this. Would be great to get some comments on this. Thanks!

Folant commented 7 years ago

grep 'target host="' | wc -l 120315 hosts, 1334 of them are commented out. grep 'ruleset name="' | wc -l 22376 rulesets ls-files | grep -i '\.[domain]\.xml' | wc -l 6552 rules have .com domains. (Maybe more, since several rules doesn't have domain in its name) 1802 - .org 934 - .net 790 - .uk 587 - .de 406 - .ru 206 - .se 164 - .nl 96 - .fr and so on. After grep -i -h 'target host="' > targets.txt, some trimming and comm -1 -2 targets.txt alexa10k.txt > alexa10kcovered.txt I got 2002 (1250 of them are .com) covered by HTTPS everywhere Top 10K Alexa domains.

jeremyn commented 7 years ago

(I'm a volunteer ruleset maintainer for this project, and not part of EFF, and this is just my own point of view.)

@00h-i-r-a00 This is a great question and an appropriate place to ask it.

There is an automated, recurring process that tags pull requests based on their place in the Alexa top million rankings. You can see these tags in the pull request list. There was a good discussion about this in https://github.com/EFForg/https-everywhere/issues/6424. This is meant to encourage attention toward higher ranked sites. I made a comment was about how focusing on global rankings might bias work toward more technologically developed communities, because they contribute more internet traffic per person, even though less developed communities have the same or perhaps greater need for privacy and security at least when viewing certain sites. However I still think the visibility we get from the current Alexa tagging is good to have.

In my opinion there is a big problem with lack of diverse coverage, and not only across national lines. About a month ago (9bc9eb401ad0635b3899e60cac2a09aeaf830c8a as an example) if you git grep -i for various terms related to Islam, you'd get very few results, less than 20 out of 20000+ rulesets, including results like Islamabad in usembassy.gov.xml. I submitted some dozens of rulesets and also mentioned this specific lack of coverage to the EFF, and they said they would look into it (and I believe that they would/did). However I'm not part of the Muslim community and relied mostly on Google searches and the Alexa list to find domains, and I may not be able to tell the difference between an important site and an unimportant site for that group, and that's part of the problem too.

Submitting an issue or pull request as a contributor requires English skill, technical knowledge, a working Internet connection, some minimal amount of computer hardware, free time, and a stable enough life to be around when a reviewer contacts you about your issue. Then as far as reviewers go, there is a small group of EFF staff, who don't normally review rulesets and have a lot of other demands on their time, and a small number of unpaid volunteers like myself who can work on what we like. Volunteers have all the same soft requirements that I described before. Both contributors and volunteers tend to prefer working on things that are personally interesting to them. Put all of that together and you can start to see why some areas are heavily represented and others are almost entirely absent.

I don't know what the fix for this is, probably a combination of more reviewers, more diverse reviewers, and more diverse contributors. At the same time we absolutely do not want to turn away existing contributors or make them feel like their rulesets are not wanted or are crowding out more "worthy" issues. Making even a small contribution shows support for the project and we want that support.

Tagging @gloomy-ghost @Hainish @J0WI for their thoughts.

J0WI commented 7 years ago

I think we cannot make any conclusion about diversity by just counting domain names. e.g. .com is the most common TLD, so it totally make sense to have the most rules for it. Also TLD are mostly not limited to a specific region or community. You also have companies or organizations that operate worldwide from the same domain. On the other side, you have communities or countries that care more about HTTPS than others.

I recommend to have a look at the metrics form Google about this topic: https://www.google.com/transparencyreport/https/metrics/#country But note that those statistic are also not showing if the users were just surfing on social media and shopping sites (which are generally on HTTPS) or if they really surfed on many different pages related to their country.

I hope that we are diverse, but I have no idea how to measure it.

jeremyn commented 7 years ago

I'm closing this since the discussion is no longer active.