pgerber commented 7 years ago

I've become to realize that the Ruleset Style Guide is out-of-date and lacking some crucial information.

I believe it would be best to completely rewrite it. I don't mind writing it myself. However, before actually doing so, there are a few questions that I'd like to have answered. I also added a preliminary outline for the new guide. Feedback is welcome!

Open Questions

Sort order

I've been made aware of a de facto standard for sorting host in rulesets. I'm not sure we want the ordering this way but we should formally agree on a standard. Also, the style guide should mention and use that order.

Apparently, the preferred ordering (#7643):

In all cases where there is a list of domains, sort them in alphabetical order starting from the top level domain at the right reading left, moving ^ and www to the top of their group. For example:
example.com
www.example.com
a.example.com
www.a.example.com
b.a.example.com
b.example.com
example.net
www.example.net
a.example.net

I'd prefer not treating www as a special case. It still confuses me and is harder to sort when generating a ruleset. Is there a reason for the www being out-of-order?

Remove wildcards from the examples

I believe they have been deprecated, we should remove them from the main examples and just add a separate example clearly indicating that wildcards should be avoided. What are valid use cases for wildcards? I there a good example?

Listing non-working hosts in comments

There appear to be different ways of adding hosts that are not covered by HTTPSEverywhere, as comments (e.g. hosts that don't have a valid certificate). Later, we might want to add the hosts to the XML to allow processing the information more easily. For now, it's probably best to agree on one style and add examples to the guide.

Is there a list of the categories that are used for non-functional hosts?

These categories appear to be used commonly:

http redirect
certificate mismatch
connection refused
self-signed certificate
timeout on https
connection refused

I also use unexpected http status code, #7706.

Did I miss anything? I personally prefer my style over the two others listed below. Using ¹ or letter (m,r) only works if there is short list of hosts. In long list, I always need to scroll up and down to read the key. Is there a style that's superior or do we just pick one?

I use this style:

<!--
        Non-functional hosts:
                Certificate mismatch:
                        - may.arbitrary.ch
                        - mx1.arbitrary.ch
                Self-signed certificate:
                        - tor-relay1.arbitrary.ch
                Timeout on https:
                        - ns1.arbitrary.ch
                        - zoe.arbitrary.ch
-->

others use:

<!--
www.topwar.ru ¹
forum.topwar.ru ⁶
wf.topwar.ru ¹
mt.topwar.ru ¹

¹ mismatch
⁶ redirects
-->

or:

<!--
        For other Migros coverage, see Migros.xml.

        Nonfunctional hosts in doit-garden-migros.ch:
                - doitgarden.ch (m,r)
                - www.doitgarden.ch (r)

                - doit-garden-migros.ch (m)
                        - www.doit-garden-migros.ch (m)
                        - nl.doit-garden-migros.ch (m)

        m: mismatch
        r: HTTP redirect
-->

… and many more variations

Naming of Rulesets

There is a name used as file name and one within the file itself. Easiest would probably be to use Example.com.xml as file name and Example.com ruleset name. How do we deal with companies that have multiple domains? Do we put all domains in one file or do we use one file per domain? The former is probably be better if you want to make edits manually. The latter if you want to create and update rules automatically.

Tools

People appear to use a whole bunch of scripts and other aids. I for instance wrote my own script for generating rulesets, #7706. Is there a list of tools and methods how to create, update and verify rulesets? What is a good way to verify HTTPSEverywhere doesn't break a site?

Structure

The current structure, if you even want to call it that, is a bit of a mess. I'd like to see it a bit more structured.

I was thinking about something like:

Beginners Guide (creating a trivial ruleset manually, step by step, with real world examples)
- ruleset (what is a ruleset, naming convention (reference to dos and don'ts))
- terminology (ruleset, label, wildcard, …)
- adding hosts as targets or comments (verifying a host is appropriate for use as target, sorting of entries (reference to dos and don'ts))
- \<rule from="^http:" to="https:" /> (without any advanced details)
Advanced (with a complete real word example)
- wildcards (marked as deprecated, when to use wildcards)
- \<rule> advanced (rewriting url, regex with external references, capturing groups, redirecting to http://)
- \<securecookie> (how to use, when to use)
- custom \<test>s (when are they needed, how they work)
- disabling rulesets
- platform="mixedcontent"
Dos and Don'ts
- file and ruleset naming
- ordering of \<target>s
- properly adding broken subdomains as comments
- \<rule>
Tools
- finding subdomains/hostnames with sublist3r
- generating a ruleset with make-trivial-rule
- using generate-ruleset.py (#7706)
- …

Let me know what I missed or what you'd like to have changed.

jeremyn commented 7 years ago

@pgerber Thanks very much for stepping forward to do this. Our documentation could really use some improvement.

This has turned into an info-dump, I hope it's not too messy.

Where to put documentation

The style guide should be merged with CONTRIBUTING.md, which has the great advantage of being linked whenever someone makes a new pull request. There's also documentation linked in the sidebar https://www.eff.org/https-everywhere that needs to be merged or otherwise reconciled with this documentation.

Sort order

When I review a list of domains, it is almost always involves me manually loading domains in a browser in the order they are listed. That's what the sort order optimizes for.

Grouping by top level domain reading right to left groups subdomains together. This is helpful for testing because subdomains are usually similar to one another.

Grouping ^ with www is less compelling but they are very often aliases of one another so it's nice for them to be together. There's also the argument I gave in https://github.com/EFForg/https-everywhere/pull/7643#issuecomment-262670369 about them both being specially important for HSTS preloading, but that's not a strong argument.

Wildcards

Wildcards are discouraged but not forbidden. They are handy sometimes, for example:

Super complicated rulesets like Google.tld_Subdomains.xml
Rulesets with a lot of domains that we can catch with a simple regex that would be tedious and error-prone to list individually, like 360.cn.xml
CDNs with an arbitrarily large number of subdomains, like https://github.com/EFForg/https-everywhere/pull/7484#issuecomment-262852427 .

The problem with wildcards is with inexperienced contributors who use a wildcard to cover only a few simple domains because they don't know better and the documentation is confusing.

Non-working hosts

In a perfect world, we would list non-working hosts in tags so they could be visible to automated tools. But in the actual world, yeah we should just list them manually in a comment.

PR https://github.com/EFForg/https-everywhere/pull/6868 has some history on this.

My own preference is what you see here: https://github.com/EFForg/https-everywhere/pull/5390#issuecomment-262837348. Here's an edited short version for easy reference:

<!--
    Invalid certificate:
        8marta.glavbukh.ru
        forum2.glavbukh.ru (incomplete certificate chain)

    Redirect to HTTP:
        8marta2013.glavbukh.ru
        den.glavbukh.ru

    Refused:
        e.glavbukh.ru
        www.e.glavbukh.ru

    Time out:
        psd.glavbukh.ru
        str.glavbukh.ru

-->

Again, I'm optimizing for me clicking down a list. I want each category grouped together because I might test each group differently. For example, I can test Invalid certificate by just opening the HTTPS URL, but Redirect to HTTP requires me to watch the network pane in the browser dev tools, and for Time out I should open both HTTP and HTTPS to make sure HTTP doesn't also time out. (incomplete certificate chain) is easy to miss in testing so it's nice to point that out. I don't care why a certificate is invalid, other than for (incomplete certificate chain), and it can be for multiple reasons anyway.

The more testing I do, the more I feel the footnote approach is just wrong. Superscript footnotes are worse on the eyes and, thus, more wrong.

The other thing to optimize for is, will this confuse a future reviewer? Will the reviewer be able to tell unambiguously when the problem is fixed? Something like Receives 403 with HTTPS only is easy to check, but unexpected http status code is mysterious and scary.

I like to organize the error categories alphabetically -- Invalid, Redirect, Refused -- just so they are deterministically ordered.

I've recently, sometimes, started kicking back targets that don't have a 200/3xx return code, or insisting on tests. See the discussion in https://github.com/EFForg/https-everywhere/issues/7662 . This is common for static. or api. domains that don't serve working content from the root subdomain such as static.example.com. I've labeled these No working URL known.

Pointers between rulesets

This sort of comment: For other Migros coverage, see Migros.xml. is definitely appropriate, in both directions.

Ruleset names

For simple sites, for the ruleset name attribute, either a site description for the domain itself is fine. For example, this ruleset could have a name of Seattle Aquarium, SeattleAquarium.org, or seattleaquarium.org. It's nice for the ruleset creator to have some ownership over the name of the thing they just made. Also they might be familiar with the site and know how to label it for other users to recognize.

This can get tricky for domains in languages I don't know. For example I don't speak Hungarian, and this file NAV.xml has a name of Nemzeti Adó- és Vámhivatal for a ruleset around a set of domains *.nav.gov.hu. I could push back and say "Change it to nav.gov.hu because I don't understand the Hungarian name", but that would be rude of me. If I go to https://nav.gov.hu , I see that name shown on the page, so I trust it.

If a ruleset covers multiple domains, then the ruleset name shouldn't be one domain. For example in pull request https://github.com/EFForg/https-everywhere/pull/6230 I changed the name from hoccer.com to Hoccer when I asked for coverage on hoccer-project.com.

Really complicated rulesets or groups of rulesets like Microsoft.xml are case-by-case.

Filenames should vaguely resemble the name so that someone looking for the file based on the name can find it without grepping. I personally prefer that filenames start with a capital letter.

Testing

There is a set of tests, which can be run independently and which run as part of Travis, that ensure things basically work. This is documented in README.md.

There is no prepared tool for testing, for example, mixed content blocking. It should be possible to make such a tool, but no one has done that yet. The latest discussion for creating rulesets is in issue https://github.com/EFForg/https-everywhere/issues/7691 but there have been other discussions before.

`make-trivial-rule`

@J0WI helpfully improved make-trivial-rule in https://github.com/EFForg/https-everywhere/pull/6868 but in practice I don't like it and I never use it. I copy an existing rule that I trust and modify it. We should make an Example.com.xml rule and tell people to copy that.

Review process

We need to tell people what to expect when they submit a pull request. It goes in the queue, they wait a while, a reviewer makes some recommendations, they do those recommendations, it's merged. In particular I personally feel people should be warned that we will squash-and-merge their commits, I ask people ahead of time, and I would like to not have to do that explicitly anymore. I think I'm the only one who feels that's important though, see the discussion in https://github.com/EFForg/https-everywhere/issues/5838.

I want to minimize the frustration caused by either a contributor or a reviewer doing work and the work isn't acknowledged in a timely way. We can't always improve the speed of response but we can improve expectations. I use the assign function as I wrote in https://github.com/EFForg/https-everywhere/issues/7242#issuecomment-251006584:

I can only speak for myself here, not the other reviewers. If I ask you to do some work on a PR and I assign the PR to myself, that means I'm giving you some assurance -- but not a firm promise -- that I will eventually review your changes and work with you until the PR is merged or closed. If I comment on a PR but don't assign it to myself, then I'm just making a comment and you shouldn't assume I'm going to review your work in more detail or even answer questions. Again, this is just my own, current approach. I'm not speaking for any other reviewer and this might change at any time.

Volunteers

Related to the above, we should let contributors know that there is not a huge group of reviewers working on these pull requests. Rather, there's a small staff with a lot of responsibilities and a small number of volunteers who have lots of discretion in when and how much they work and what they work on.

Multiple pull requests

We should discourage people, especially new contributors, from submitting a ton of pull requests at once. Everyone loses when a contributor spends a ton of time making a large number of bad pull requests.

Submitting issues

We should give tips for submitting good issues: describe your operating system and browser and the version of each, give a specific URL, describe the behavior, include a screenshot if appropriate. Here is a comment where I describe some problems with not-so-well-written issues.

Small pull requests

Each pull request should be as atomic as possible. At the moment I'm not sure how to word that in a less technical way but I hope my meaning is clear.

HSTS preloading

It would be nice if contributors checked the HSTS preload list before creating new pull requests. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/7126.

Security problems

Users who want to report a security problem should be advised to do what is described in https://www.eff.org/security .

Mixed content

We should describe the difference between good and bad mixed content warnings. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/6629.

Great Firewall (GFW) of China

Rulesets that behave differently inside the GFW have been especially tricky to test. See pull requests https://github.com/EFForg/https-everywhere/pull/6889 and https://github.com/EFForg/https-everywhere/pull/7586. It would be nice if we could document this somehow.

`securecookie`

securecookie tags can be very confusing and should be clarified. There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/2907.

Disabling rather than deleting rulesets

When to disable and when to delete? There's a related discussion in issue https://github.com/EFForg/https-everywhere/issues/7338.

`Allow edits from maintainers` option

When should reviewers enable this option and what does it mean? There's a related discussion in https://github.com/EFForg/https-everywhere/issues/7242.

Sublist3r

How should we discuss Sublist3r? What options should contributors use? There's a related discussion in https://github.com/EFForg/https-everywhere/issues/7277.

jeremyn commented 7 years ago

Pinging @fuglede @gloomy-ghost @Hainish @J0WI for their input.

ghost commented 7 years ago

I think #7717 should have been made a part of this discussion, but I guess it's too late.

ghost commented 7 years ago

@jeremyn Don't forget China is not the only country that practices internet censorship. In my opinion internet censorship is out of scope of HTTPS Everywhere. It's unpredictable, it's difficult to fix issues related to it and very unlikely that censors would change their methods of blocking on HTTPS Everywhere's request. Users in such countries should look for way to bypass internet censorship if sites they visit are broken because of it.

ghost commented 7 years ago

@jeremyn I live in a country that practices internet censorship and HTTPS Everywhere prevented me from accessing multiple websites, for example Archive.org. I guess that's the price for security. I'm not going to turn HTTPS Everywhere off because of that.

Hainish commented 7 years ago

I strongly agree that we should not modify our policy for inclusion based on whether a country censors a site, even if it is a country with a large set of potential or current users. This may hurt the adoption of the tool from within censorship regimes, but modifying our rules based on the vicissitudes censors contributes to a technical ecosystem which legitimates and normalizes censorship.

The tension between accessibility for users inside censorship regimes and the desire to make the internet more secure is something that many sites have had to deal with recently. The English-language Wikipedia has had to deal with this tension very directly. It is my opinion that they made the right choice of forcing encryption for all users, even if this increases the chances that they will be censored, as they have in the past. It puts the onus on the censorship regime to be enacting the censorious behavior, rather than the sites exhibiting self-censorship. They chose not to contribute to that technical infrastructure of censorship, and I think they were right in doing so, whether or not it hurts their adoption from within these regimes.

Also, as @smw-koops points out, censorship can be spurious and hard to model for, and incorporating this into an ever-growing list of concerns for ongoing and new rulesets adds additional process for maintainers.

jeremyn commented 7 years ago

@smw-koops I'm fine with mentioning downgrading in the documentation based on whatever the consensus from #7717 is.

Regarding censorship, I don't mean that HTTPS Everywhere should accommodate censors. What I mean is that there are pull requests that say that certain sites are only available in the GFW, or that they behave differently if they're inside the GFW. For example a site may use a CDN with different HTTPS behavior if the request comes from inside the GFW than if it comes from outside. Pull request https://github.com/EFForg/https-everywhere/pull/7586 is a messy example. I don't want to sign off on a pull request that I can't fully test. Right now I think we only have one reviewer who can reliable test these inside-the-GFW pull requests, @gloomy-ghost, so if @gloomy-ghost is the one who submits the pull request, they're stuck because no one else can review it.

So, the documentation should mention this problem and give advice to help people submitting these pull requests. For example, in https://github.com/EFForg/https-everywhere/pull/7586 I asked @gloomy-ghost to remove as much of the inside-the-GFW stuff as possible, which let me test enough of it that I felt comfortable merging it. (There was one domain in a comment that I couldn't check.)

jeremyn commented 7 years ago

Minor point: I like tests directly under targets with an indent, like this:

    <target host="example.com" />
        <test url="http://example.com/myurl/" />
    <target host="www.example.com" />

If everyone agrees, we should mention this in the documentation either explicitly or by example.

Hainish commented 7 years ago

I'm fine with this. This irks my typical indentation-following-xml-scope convention, but that's not in the xml standard and I understand the utility of this indentation formatting.

jeremyn commented 7 years ago

Another thing that we might want to document is the discussion in https://github.com/EFForg/https-everywhere/issues/7243#issuecomment-266476451, answering the question of why we will or won't disable rulesets that have problems until those problems are fixed.

jeremyn commented 7 years ago

My note in https://github.com/EFForg/https-everywhere/pull/5025#issuecomment-271467037 about when to add comments for nonworking domains might be interesting for the documentation.

Hainish commented 7 years ago

This shoudl be resolved with https://github.com/EFForg/https-everywhere/issues/8193 - we should break out any separate concerns into separate issues.

EFForg / https-everywhere

rewriting the Ruleset Style Guide #7707

Open Questions

Sort order

Remove wildcards from the examples

Listing non-working hosts in comments

Naming of Rulesets

Tools

Structure

Where to put documentation

Sort order

Wildcards

Non-working hosts

Pointers between rulesets

Ruleset names

Testing

`make-trivial-rule`

Review process

Volunteers

Multiple pull requests

Submitting issues

Small pull requests

HSTS preloading

Security problems

Mixed content

Great Firewall (GFW) of China

`securecookie`

Disabling rather than deleting rulesets

`Allow edits from maintainers` option

Sublist3r

EFForg / https-everywhere

rewriting the Ruleset Style Guide #7707

Open Questions

Sort order

Remove wildcards from the examples

Listing non-working hosts in comments

Naming of Rulesets

Tools

Structure

Where to put documentation

Sort order

Wildcards

Non-working hosts

Pointers between rulesets

Ruleset names

Testing

make-trivial-rule

Review process

Volunteers

Multiple pull requests

Submitting issues

Small pull requests

HSTS preloading

Security problems

Mixed content

Great Firewall (GFW) of China

securecookie

Disabling rather than deleting rulesets

Allow edits from maintainers option

Sublist3r

`make-trivial-rule`

`securecookie`

`Allow edits from maintainers` option