Closed pgerber closed 7 years ago
@pgerber Thanks very much for stepping forward to do this. Our documentation could really use some improvement.
This has turned into an info-dump, I hope it's not too messy.
The style guide should be merged with CONTRIBUTING.md
, which has the great advantage of being linked whenever someone makes a new pull request. There's also documentation linked in the sidebar https://www.eff.org/https-everywhere that needs to be merged or otherwise reconciled with this documentation.
When I review a list of domains, it is almost always involves me manually loading domains in a browser in the order they are listed. That's what the sort order optimizes for.
Grouping by top level domain reading right to left groups subdomains together. This is helpful for testing because subdomains are usually similar to one another.
Grouping ^
with www
is less compelling but they are very often aliases of one another so it's nice for them to be together. There's also the argument I gave in https://github.com/EFForg/https-everywhere/pull/7643#issuecomment-262670369 about them both being specially important for HSTS preloading, but that's not a strong argument.
Wildcards are discouraged but not forbidden. They are handy sometimes, for example:
Google.tld_Subdomains.xml
360.cn.xml
The problem with wildcards is with inexperienced contributors who use a wildcard to cover only a few simple domains because they don't know better and the documentation is confusing.
In a perfect world, we would list non-working hosts in tags so they could be visible to automated tools. But in the actual world, yeah we should just list them manually in a comment.
PR https://github.com/EFForg/https-everywhere/pull/6868 has some history on this.
My own preference is what you see here: https://github.com/EFForg/https-everywhere/pull/5390#issuecomment-262837348. Here's an edited short version for easy reference:
<!--
Invalid certificate:
8marta.glavbukh.ru
forum2.glavbukh.ru (incomplete certificate chain)
Redirect to HTTP:
8marta2013.glavbukh.ru
den.glavbukh.ru
Refused:
e.glavbukh.ru
www.e.glavbukh.ru
Time out:
psd.glavbukh.ru
str.glavbukh.ru
-->
Again, I'm optimizing for me clicking down a list. I want each category grouped together because I might test each group differently. For example, I can test Invalid certificate
by just opening the HTTPS URL, but Redirect to HTTP
requires me to watch the network pane in the browser dev tools, and for Time out
I should open both HTTP and HTTPS to make sure HTTP doesn't also time out. (incomplete certificate chain)
is easy to miss in testing so it's nice to point that out. I don't care why a certificate is invalid, other than for (incomplete certificate chain)
, and it can be for multiple reasons anyway.
The more testing I do, the more I feel the footnote approach is just wrong. Superscript footnotes are worse on the eyes and, thus, more wrong.
The other thing to optimize for is, will this confuse a future reviewer? Will the reviewer be able to tell unambiguously when the problem is fixed? Something like Receives 403 with HTTPS only
is easy to check, but unexpected http status code
is mysterious and scary.
I like to organize the error categories alphabetically -- Invalid
, Redirect
, Refused
-- just so they are deterministically ordered.
I've recently, sometimes, started kicking back target
s that don't have a 200
/3xx
return code, or insisting on test
s. See the discussion in https://github.com/EFForg/https-everywhere/issues/7662 . This is common for static.
or api.
domains that don't serve working content from the root subdomain such as static.example.com
. I've labeled these No working URL known
.
This sort of comment: For other Migros coverage, see Migros.xml.
is definitely appropriate, in both directions.
For simple sites, for the ruleset
name
attribute, either a site description for the domain itself is fine. For example, this ruleset could have a name
of Seattle Aquarium
, SeattleAquarium.org
, or seattleaquarium.org
. It's nice for the ruleset creator to have some ownership over the name of the thing they just made. Also they might be familiar with the site and know how to label it for other users to recognize.
This can get tricky for domains in languages I don't know. For example I don't speak Hungarian, and this file NAV.xml
has a name
of Nemzeti Adó- és Vámhivatal
for a ruleset around a set of domains *.nav.gov.hu
. I could push back and say "Change it to nav.gov.hu
because I don't understand the Hungarian name", but that would be rude of me. If I go to https://nav.gov.hu , I see that name
shown on the page, so I trust it.
If a ruleset covers multiple domains, then the ruleset name
shouldn't be one domain. For example in pull request https://github.com/EFForg/https-everywhere/pull/6230 I changed the name
from hoccer.com
to Hoccer
when I asked for coverage on hoccer-project.com
.
Really complicated rulesets or groups of rulesets like Microsoft.xml
are case-by-case.
Filenames should vaguely resemble the name
so that someone looking for the file based on the name
can find it without grepping. I personally prefer that filenames start with a capital letter.
There is a set of tests, which can be run independently and which run as part of Travis, that ensure things basically work. This is documented in README.md
.
There is no prepared tool for testing, for example, mixed content blocking. It should be possible to make such a tool, but no one has done that yet. The latest discussion for creating rulesets is in issue https://github.com/EFForg/https-everywhere/issues/7691 but there have been other discussions before.
make-trivial-rule
@J0WI helpfully improved make-trivial-rule
in https://github.com/EFForg/https-everywhere/pull/6868 but in practice I don't like it and I never use it. I copy an existing rule that I trust and modify it. We should make an Example.com.xml
rule and tell people to copy that.
We need to tell people what to expect when they submit a pull request. It goes in the queue, they wait a while, a reviewer makes some recommendations, they do those recommendations, it's merged. In particular I personally feel people should be warned that we will squash-and-merge their commits, I ask people ahead of time, and I would like to not have to do that explicitly anymore. I think I'm the only one who feels that's important though, see the discussion in https://github.com/EFForg/https-everywhere/issues/5838.
I want to minimize the frustration caused by either a contributor or a reviewer doing work and the work isn't acknowledged in a timely way. We can't always improve the speed of response but we can improve expectations. I use the assign function as I wrote in https://github.com/EFForg/https-everywhere/issues/7242#issuecomment-251006584:
I can only speak for myself here, not the other reviewers. If I ask you to do some work on a PR and I assign the PR to myself, that means I'm giving you some assurance -- but not a firm promise -- that I will eventually review your changes and work with you until the PR is merged or closed. If I comment on a PR but don't assign it to myself, then I'm just making a comment and you shouldn't assume I'm going to review your work in more detail or even answer questions. Again, this is just my own, current approach. I'm not speaking for any other reviewer and this might change at any time.
Related to the above, we should let contributors know that there is not a huge group of reviewers working on these pull requests. Rather, there's a small staff with a lot of responsibilities and a small number of volunteers who have lots of discretion in when and how much they work and what they work on.
We should discourage people, especially new contributors, from submitting a ton of pull requests at once. Everyone loses when a contributor spends a ton of time making a large number of bad pull requests.
We should give tips for submitting good issues: describe your operating system and browser and the version of each, give a specific URL, describe the behavior, include a screenshot if appropriate. Here is a comment where I describe some problems with not-so-well-written issues.
Each pull request should be as atomic as possible. At the moment I'm not sure how to word that in a less technical way but I hope my meaning is clear.
It would be nice if contributors checked the HSTS preload list before creating new pull requests. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/7126.
Users who want to report a security problem should be advised to do what is described in https://www.eff.org/security .
We should describe the difference between good and bad mixed content warnings. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/6629.
Rulesets that behave differently inside the GFW have been especially tricky to test. See pull requests https://github.com/EFForg/https-everywhere/pull/6889 and https://github.com/EFForg/https-everywhere/pull/7586. It would be nice if we could document this somehow.
securecookie
securecookie
tags can be very confusing and should be clarified. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/2907.
When to disable and when to delete? There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/7338.
Allow edits from maintainers
optionWhen should reviewers enable this option and what does it mean? There's a related discussion in https://github.com/EFForg/https-everywhere/issues/7242.
How should we discuss Sublist3r? What options should contributors use? There's a related discussion in https://github.com/EFForg/https-everywhere/issues/7277.
Pinging @fuglede @gloomy-ghost @Hainish @J0WI for their input.
I think #7717 should have been made a part of this discussion, but I guess it's too late.
@jeremyn Don't forget China is not the only country that practices internet censorship. In my opinion internet censorship is out of scope of HTTPS Everywhere. It's unpredictable, it's difficult to fix issues related to it and very unlikely that censors would change their methods of blocking on HTTPS Everywhere's request. Users in such countries should look for way to bypass internet censorship if sites they visit are broken because of it.
@jeremyn I live in a country that practices internet censorship and HTTPS Everywhere prevented me from accessing multiple websites, for example Archive.org. I guess that's the price for security. I'm not going to turn HTTPS Everywhere off because of that.
I strongly agree that we should not modify our policy for inclusion based on whether a country censors a site, even if it is a country with a large set of potential or current users. This may hurt the adoption of the tool from within censorship regimes, but modifying our rules based on the vicissitudes censors contributes to a technical ecosystem which legitimates and normalizes censorship.
The tension between accessibility for users inside censorship regimes and the desire to make the internet more secure is something that many sites have had to deal with recently. The English-language Wikipedia has had to deal with this tension very directly. It is my opinion that they made the right choice of forcing encryption for all users, even if this increases the chances that they will be censored, as they have in the past. It puts the onus on the censorship regime to be enacting the censorious behavior, rather than the sites exhibiting self-censorship. They chose not to contribute to that technical infrastructure of censorship, and I think they were right in doing so, whether or not it hurts their adoption from within these regimes.
Also, as @smw-koops points out, censorship can be spurious and hard to model for, and incorporating this into an ever-growing list of concerns for ongoing and new rulesets adds additional process for maintainers.
@smw-koops I'm fine with mentioning downgrading in the documentation based on whatever the consensus from #7717 is.
Regarding censorship, I don't mean that HTTPS Everywhere should accommodate censors. What I mean is that there are pull requests that say that certain sites are only available in the GFW, or that they behave differently if they're inside the GFW. For example a site may use a CDN with different HTTPS behavior if the request comes from inside the GFW than if it comes from outside. Pull request https://github.com/EFForg/https-everywhere/pull/7586 is a messy example. I don't want to sign off on a pull request that I can't fully test. Right now I think we only have one reviewer who can reliable test these inside-the-GFW pull requests, @gloomy-ghost, so if @gloomy-ghost is the one who submits the pull request, they're stuck because no one else can review it.
So, the documentation should mention this problem and give advice to help people submitting these pull requests. For example, in https://github.com/EFForg/https-everywhere/pull/7586 I asked @gloomy-ghost to remove as much of the inside-the-GFW stuff as possible, which let me test enough of it that I felt comfortable merging it. (There was one domain in a comment that I couldn't check.)
Minor point: I like test
s directly under target
s with an indent, like this:
<target host="example.com" />
<test url="http://example.com/myurl/" />
<target host="www.example.com" />
If everyone agrees, we should mention this in the documentation either explicitly or by example.
I'm fine with this. This irks my typical indentation-following-xml-scope convention, but that's not in the xml standard and I understand the utility of this indentation formatting.
Another thing that we might want to document is the discussion in https://github.com/EFForg/https-everywhere/issues/7243#issuecomment-266476451, answering the question of why we will or won't disable rulesets that have problems until those problems are fixed.
My note in https://github.com/EFForg/https-everywhere/pull/5025#issuecomment-271467037 about when to add comments for nonworking domains might be interesting for the documentation.
This shoudl be resolved with https://github.com/EFForg/https-everywhere/issues/8193 - we should break out any separate concerns into separate issues.
I've become to realize that the Ruleset Style Guide is out-of-date and lacking some crucial information.
I believe it would be best to completely rewrite it. I don't mind writing it myself. However, before actually doing so, there are a few questions that I'd like to have answered. I also added a preliminary outline for the new guide. Feedback is welcome!
Open Questions
Sort order
I've been made aware of a de facto standard for sorting host in rulesets. I'm not sure we want the ordering this way but we should formally agree on a standard. Also, the style guide should mention and use that order.
Apparently, the preferred ordering (#7643):
I'd prefer not treating
www
as a special case. It still confuses me and is harder to sort when generating a ruleset. Is there a reason for thewww
being out-of-order?Remove wildcards from the examples
I believe they have been deprecated, we should remove them from the main examples and just add a separate example clearly indicating that wildcards should be avoided. What are valid use cases for wildcards? I there a good example?
Listing non-working hosts in comments
There appear to be different ways of adding hosts that are not covered by HTTPSEverywhere, as comments (e.g. hosts that don't have a valid certificate). Later, we might want to add the hosts to the XML to allow processing the information more easily. For now, it's probably best to agree on one style and add examples to the guide.
Is there a list of the categories that are used for non-functional hosts?
These categories appear to be used commonly:
I also use unexpected http status code, #7706.
Did I miss anything? I personally prefer my style over the two others listed below. Using ¹ or letter (m,r) only works if there is short list of hosts. In long list, I always need to scroll up and down to read the key. Is there a style that's superior or do we just pick one?
I use this style:
others use:
or:
… and many more variations
Naming of Rulesets
There is a name used as file name and one within the file itself. Easiest would probably be to use Example.com.xml as file name and Example.com ruleset name. How do we deal with companies that have multiple domains? Do we put all domains in one file or do we use one file per domain? The former is probably be better if you want to make edits manually. The latter if you want to create and update rules automatically.
Tools
People appear to use a whole bunch of scripts and other aids. I for instance wrote my own script for generating rulesets, #7706. Is there a list of tools and methods how to create, update and verify rulesets? What is a good way to verify HTTPSEverywhere doesn't break a site?
Structure
The current structure, if you even want to call it that, is a bit of a mess. I'd like to see it a bit more structured.
I was thinking about something like:
Let me know what I missed or what you'd like to have changed.