Allow easier input of domains requested to be reviewed

Foorack commented 8 years ago

Credits for this idea goes to @jeremyn

The idea was originally posted in https://github.com/EFForg/https-everywhere/issues/6307 Moving it here to keep the other issue clean from off-topic discussion.

The discussion left off with me saying:

users could tweet domains using some hashtag:

I like the idea as I am a Twitter user, however, not everyone have a Twitter account and I think it will be difficult to list and mark which are done.

What about using Google Forms and having it being saved in a Spreadsheet? Contributors could mark domains with its status and the spreadsheet would be public for read-access. Thoughts?

jeremyn commented 8 years ago

Thanks for making this issue @Foorack.

Important questions are, how many sites do we expect people to enter? Who are the different categories of users and what are their levels of trustworthiness and technical skill? Who is going to officially own and be responsible for this tool?

Anything that allows unrestricted write access to the general internet is a challenge, and anything that allows the general internet to write something and have it automatically appear somewhere else like a status page is even worse. At a minimum the tool needs one free-form text field to submit a domain, and another field to submit optional notes about that domain ("This site behaves differently if you access it from outside the U.S....", that sort of thing.) So, the back-end needs to:

make sure the submitted domain is a valid domain in the RFC meaning of "valid"
do a basic check that the domain is actually a use case for HTTPS Everywhere (supports both HTTP and HTTPS)
do basic automated spam/troll filtering
do basic de-duplication

I don't think there should be a status page, but if we do have one, then we also need to put the entries into a hidden queue for a trusted person to approve before they appear on that page.

So most of the work is back-end data processing. I'm not experienced with Google Forms but my guess is it would save us a little front-end work in creating a form but would not help at all on the back-end and might even complicate things.

With the Twitter approach, the problem becomes almost entirely back-end since there is no form.

My feeling is this tool should be a small but complete and standalone web application.

rugk commented 8 years ago

Anything that allows unrestricted write access to the general internet is a challenge, and anything that allows the general internet to write something and have it automatically appear somewhere else like a status page is even worse.

Eh, we do have an unrestricted internet. Fortunately. At least in most places. At least when we exclude net neutrality issu... I get too far. So what do you mean by "internet"? You should specify this term a bit more. Also we can't write to the "general internet". Don't know what a "specific internet" would be, but, well... I think these sentences confuse me.

At a minimum the tool needs one free-form text field to submit a domain, and another field to submit optional notes about that domain ("This site behaves differently if you access it from outside the U.S....", that sort of thing.)

Okay, getting back to sentences I understand. So yes, that's it. Although a check-boxes for common issues (mixed content, ...) would be useful. In this case, however, we would have to explain users what mixed content is. That's quite difficult... However this could be circumvented by letting HTTPSE check by themself whether it is Mixed Content or by doing a temporary ruleset and ask the user again after - say - one week: "Could you use XY.com successfully?" Basically this would be some interactive way to submit domains and they are usable for non tech-savvy users.

make sure the submitted domain is a valid domain in the RFC meaning of "valid"

I'm sure some kind of RegExp will do this. However we could also solve this issue on the client site (when we use the "interactive submission" idea again): Users can only submit the domain they are currently using. And if browsers can connect to these ones - and HTTPS could e.g. somehow check the IP to make sure it is no internal LAN ip - they are allowed to submit it.

I think all the other things you mentioned can also be checked on the client side with the interactive submission.

I'm not experienced with Google Forms

I am sure we don't need this. It's a simple UI... I am also sure EFF does not want to "expose" their users to Google. :smile:

My feeling is this tool should be a small but complete and standalone web application.

My feeling is it should be included in the HTTPSE addon and maybe it could ask the user when they visit a HTTPS: "Hey, you visited example.com for more than 100 times. We have not included this into our HTTPS list yet. Do you want to help us by submitting this domain?"

rugk commented 8 years ago

BTW there is already a backend, which does some mixed content testing and so on: https://github.com/EFForg/https-everywhere/issues/1192

rugk commented 8 years ago

Also https://github.com/EFForg/httpse-ruleset-tests could be somehow integrated into HTTPSE. And I thought there was an issue for this task, but I cannot find it right now.

Foorack commented 8 years ago

I like the idea of integrating it into the extension and allowing users to report/submit domains.

and ask the user again after - say - one week: "Could you use XY.com successfully?"

Though I'm not that big fan of HTTPSE breaking websites, especially for non-techy users as they will just remove the extension completely. Even if we implement it into the extension instead of having a web UI there is still the need for a backend system collecting and testing the domains. I would be really good if the backend created issues automatically. :)

rugk commented 8 years ago

HTTPSE breaking websites, especially for non-techy users as they will just remove the extension completely.

Yeah, but in my idea the user has to opt-in explicitly and there could even be a "This site is broken!" button or something shown, which reminds the user that something is not working. Aso - in an ideal case - HTTPE could detect Mixed content issues by itself and stop the experiment or at least ask the user whether the site is okay.

Even if we implement it into the extension instead of having a web UI there is still the need for a backend system collecting and testing the domains.

Yeah, at least for the bad guys using bots to submit invalid domains and so on...

I would be really good if the backend created issues automatically.

No, it would be good if it opens PRs automatically. :smiley:

As suggested by @jeremyn in https://github.com/EFForg/https-everywhere/issues/3069#issuecomment-239867644 the bot could also parse the Certificate Transparency list, but in this case the domain needs to be checked on the server side again.

jeremyn commented 8 years ago

I'm sorry my general/specific internet phrasing was confusing. "General" means anonymous users, "specific" means people with some EFF trust.

I meant that we want to protect EFF from being DOS'd or embarrassed by malicious domain submitters. We need to ask how a hostile government or criminal organization with a large botnet could use this tool to overwhelm EFF's systems or volunteer resources. If it allows for people to submit endless variations of dc38cea7-cff6-40d0-85a0-2876e08d9259.com then we need to plan for that. If public users can browse the status of all submitted domains in a list, then we want to prevent users from submitting endless variations of fake "www.eff-sucksssssss.com" domains or fake domains with curse words, slurs and so on. Unfortunately I think some sort of captcha would be required.

Client-side validation is fine in addition to server-side validation but shouldn't replace server-side validation.

We should assume non-malicious users of this tool have no idea how to express what's wrong with the site other than "both http and https work, help!" The optional note field is for the user to tell us anything weird about the site they happen to know about.

I like the idea of putting this tool directly into the add-on and/or allowing people to submit the site they're currently browsing.

My hope for this issue is to let users easily report a site so some technical volunteer can look into it as time permits. Automating creating rulesets, issues or pull requests is a whole extra layer of work and maintenance that I think most people involved would eventually regret doing.

rugk commented 8 years ago

Unfortunately I think some sort of captcha would be required.

That's a good idea. This would also make it harder for people wanting to spam and would certainly discourage bots.

Also the HTTPSE server could require a submission of a current state for seven days. This may be the time the user has to test the domain (& the temporary ruleset) by themself before the request is even published in the status page/add-on page.

dc38cea7-cff6-40d0-85a0-2876e08d9259.com

A server could just test whether it is pingable/curl succeeds. And again: Users might only be able to submit a domain when they are currently browsing it. This also means that "www.eff-sucksssssss.com" is impossible unless they really register this domain. (in which case it could again be - technically - valid for HTTPSE inclusion)

Client-side validation is fine in addition to server-side validation but shouldn't replace server-side validation.

Yeah, of course. I just think the client side "validation" (with the "test period" mentioned) could filter out many broken websites and so on. It only matters to non-malicious users and prevents them from submitting bad entries.

Automating creating rulesets, issues or pull requests is a whole extra layer of work and maintenance that I think most people involved would eventually regret doing.

I think compared with what we already proposed here, it is not that difficult. It just needs a bot user interacting with GitHubs API.

jeremyn commented 8 years ago

Sorry, what does this mean? "Also the HTTPSE server could require a submission of a current state for seven days."

I don't want this to require, encourage, or even allow users to provide a point of contact. People might feel uncomfortable if they think it's not anonymous. They might be nervous about some programmer contacting them with technical questions. They might be nervous about talking with someone in English. They might just prefer not to be bothered. On the other end of it, they might expect personal follow-ups. They might send emails to EFF asking for a list of domains they submitted, like they have some kind of account with EFF. Law enforcement might contact EFF to get the email address of someone who submitted a controversial domain.

I agree that client-side validation to reduce user error and ease work on EFF server-side is fine.

I'm inclined to leave automated issues/pull requests as a possible "phase 2" for this project, after the basic infrastructure of collecting domains and making them visible to volunteers is figured out.

rugk commented 8 years ago

Sorry, what does this mean? "Also the HTTPSE server could require a submission of a current state for seven days."

Okay, imagine this:

User starts submission of site in client (addon).
Client informs user that a test phase begins and provides some information on how to cancel, etc.
For each day of the test phase, which passes, the client sends the current state (domain.example: 0 mixed content, 0 times user clicked on "this does not work", 12 sites visited) or it just sends the domain to the HTTPSE server (domain.example: test phase day 1 of 7). The server keeps track of the days and will only allow a final submission once 7 days are passed. Additionally it could verify whether the data of the days, which has been send, is logically. If not it rejects the request. Also the server could further require additional tests and informs the client to make the test phase longer. E.g. less than 100 sites visited --> user needs to test more sites.
When everything works, the site is finally submitted (possibly after asking the user again to confirm the site worked in the last 7 days.) with a lot of test stats already collected, to this repo and (e.g.) a PR is automatically made. The user could be informed of the PR URL, so they can track the case further and may be able to respond to further questions. (The client could also allow users to enter their GitHub name to be @mentioned, but this is another details, which does not matter for the general concept here.)

The requirement to submit the data each day and to make the data looking realistic and to react on different replies of the server, could make it potentially harder to fake these submissions.

Of course this whole thing is only imagination. I have not taken into account how it is exactly possible or how difficult it is to implement this. I am just throwing ideas into this issue. :laughing:

As for your contact thing: As explained in the lines above, I think this is a good thing. But of course it should be optional. With all these tests involved, there (hopefully) should not be the need to contact the author of the domain submission, so it can be anonymous for those who want to say anonymous. In this case all users fearing to get technical questions, might just not enter their GitHub username. (They would not even be registered on GitHub, so that's fine.)

On the other end of it, they might expect personal follow-ups. They might send emails to EFF asking for a list of domains they submitted

In my example an optional GitHub username submission would make this obsolete.

Law enforcement might contact EFF to get the email address of someone who submitted a controversial domain.

This would be obsolete too as all information is public anyway. It's on GitHub. (They might contact GitHub to get IPs, but this does not matter for HTTPSE here.)

I'm inclined to leave automated issues/pull requests as a possible "phase 2" for this project, after the basic infrastructure of collecting domains and making them visible to volunteers is figured out.

Structure it as you want. Split the project as you want. Implement it or do not implement it. As I said I am just submitting some ideas here, which would be great both from a user and developer (= here: ruleset maintainer) perspective. I'd just say one thing: If this would be there already, I think it would be an awesome thing.

jeremyn commented 8 years ago

I'm just throwing out ideas too. I don't know if EFF even likes this idea or would sponsor it. I may be willing to do some coding for it if the EFF says they will use it and can provide specifications.

Imagine your least technical friend. This person notices that sometimes there's a lock when they visit their (e.g.) hometown bank's website and sometimes there isn't. Their genius friend @rugk put this "S" thing into their internet program that's supposed to make sure the lock is always there. In my view, ideally this user can somehow submit their bank's website for review through this tool we're discussing without being intimidated by the entry form. If it mentions GitHub credentials or there's any hint that some stranger from the internet might contact them, they won't do it.

Also, allowing for the possibility of unexpected contact from EFF opens the door to phishing attacks, for example "You recently submitted $POPULAR_BANK.com to the EFF for review. We're having trouble with it. Please provide your account id and password." etc.

Yet another danger is maliciously submitting a website in someone else's name. I submit $DISGUSTING_SITE.com under your name and then you get an email asking about. Basically if we take email addresses then we need to verify that the person submitting it actually controls it.

Basically, I don't think we should accept user contact info. In fact we probably shouldn't accept anything but the domain field, not even optional notes. Maybe we only let them submit the domain they are currently browsing with no typing involved.

I really don't like the idea of collecting browsing history in the add-on for any reason. I think we can verify the domain is legitimate by a few server-side automated requests and maybe a DNS lookup, in addition to any client-side regex-type checking we think is worth doing.

rugk commented 8 years ago

I really don't like the idea of collecting browsing history in the add-on for any reason.

It is not collected. It is only done locally of course.

Hainish commented 8 years ago

Our current problem is not currently lack of coverage (although clearly the more coverage the better), but rather scalability. Unfortunately, the desirability for better coverage is at ends with the scalability issue.

The number of domains that have deployed HTTPS has increased, and as a result we're seeing a greater number of ruleset PRs. These PRs generally need a human to review them. We can generally assume that for trivial rulesets (ones for which the rule is just to redirect from http to https) there is no malicious intent. Even PRs without malicious intent, though, have to be spot-checked by ruleset maintainers to make sure there is no subtle way that functionality is lost. So there's a bottleneck between the point of PR submission and inclusion.

There is also the problem of scalability in terms of memory consumption and download size. We did some memory profiling and saw that a long-running HTTPSE instance on Chrome takes up about 53mb, and the extension download for Firefox is 2.7mb, for Chrome 1.7. This will also get higher as more rulesets come rolling in. This probably won't be a huge issue unless we have some automated submission & inclusion system, but we do want to keep both the memory profile and download size low.

rugk commented 8 years ago

So in short: Too much domains also would not be good for HTTPSE.

Hmm, maybe HTTPS by default is coming nearer. :smiley:

numismatika commented 8 years ago

Regarding scalability, one could combine it with an "alexa ranking test" so we secure not every private blog. Would keep memory low.

Mabye not entirely automated and with a small entry barrier to only go to the slightly more advanced people, one could do a default-off #2718 , no automated submission. We wouldn't be overwhelmed with "hey check this" but people that are intrested and notice something could still help. If a site really is broken and they care , my missing $ on heise.de for example, people will come over here. I do not think we need an automated complain management ;)

More important is growing the test bench here i would say. Find Rules that can be purged or are broken or some that are disabled and could be deleted or re-instated because the cert is valid again or so. Maintenance is difficult already

jeremyn commented 8 years ago

I've been doing ruleset maintenance for a little while now. My updated opinion on this issue is that I really want at least semi-automated ruleset creation and testing tools before we open the gates up to a flood of new requests. Pull request #6857 is a good example of the sort of effort potentially needed to properly handle even a simple "please add this school" issue. At the moment even the Good Volunteer Task-labeled issues aren't getting done, let alone the hundreds of other issues. I'm not particularly eager to invite thousands of additional requests through Twitter etc.

strugee commented 7 years ago

Hey! So as discussed over email with @Hainish I'm planning to work on a tool to help with this. Here's a rough MVP proposal:

Web app
Builds a prioritized queue of ruleset submissions to review from open PRs and ordered based on Alexa rankings
Basically all the UI does is present "this is the next ruleset you should review" based on the queue ordering/prioritization

Once that MVP is done (which I really don't think will take very long - a couple full work days, maybe a week, including time to write the test suite, etc.) I can imagine some additional features:

Sign-in with GitHub so that reviews can be more tightly integrated
- You could "claim" a review in the app and that would be mapped to GitHub assignment
- You could approve or request changes in-app which would map directly to GitHub Reviews
- Etc.
Semiautomated ruleset testing/ <test> tag creation based on crawling the site
Metadata tracking of some sort - e.g. (stealing directly from @Hainish here) have we reached out to a site owner? When was the last time we did so? I'm not sure whether this should be tied to GitHub in some way or tracked in-app; probably better to cross that bridge when we come to it

I prefer to work in Node.js/Express and test things with Perjury, which is a better implementation of Vows, but if people really seriously hate either of those things I could do something else.

jeremyn commented 7 years ago

I'm not sure what the proposed web application is for. This issue was to discuss making it easier for users to submit domains that need a ruleset, essentially solving the problem that the catch-all issue #3069 does. But

Builds a prioritized queue of ruleset submissions to review from open PRs and ordered based on Alexa rankings

Basically all the UI does is present "this is the next ruleset you should review" based on the queue ordering/prioritization

sounds like the proposed web app is supposed to help me, as a reviewer, find pull requests to review. Is that right? If so, I don't want it, thanks anyway. I can use GitHub to find stuff to work on, if I want to.

For

Semiautomated ruleset testing/ tag creation based on crawling the site

we already have five separate projects working toward automated ruleset creation. I think thoughts on that topic should go in that issue.

Hainish commented 7 years ago

Looking through this proposed web application again, it seems like this differs from what @strugee and I discussed in the way that @jeremyn points out.

The intention of this webapp is to automate prioritization of requested rulesets, not ones that already have a PR and have been submitted.

This means that users (perhaps logged in through github, as you have in your post-MVP) should be able to submit new sites they wish to have coverage for. The application would automatically prioritize these. If a site already has coverage and needs additional work or needs to be reviewed because it's become stale, we should incorporate that into the logic, too.

Sorry for the confusion!

Hainish commented 7 years ago

@strugee as far as the language you wish to implement it in, that's up to you. Node is a perfectly good choice and many of the helper applications in utils/ are written in node. Eventually it would be nice to port all the python code over to node as well, so we can have codebase that requires knowledge of only one language, which will help with long-term maintainability.

jeremyn commented 7 years ago

@Hainish Who is this prioritization for, meaning, who is consuming this prioritization information? People who want to work on high-priority rulesets? With the existing tags, they can just make bookmarks like this:

https://github.com/EFForg/https-everywhere/pulls?utf8=%E2%9C%93&q=is%3Aopen%20is%3Apr%20label%3Atop-1k%20sort%3Acreated-asc%20comments%3A0

Hainish commented 7 years ago

@jeremyn there's absolutely no standardization in the way people submit domains they wish to see coverage for. One person opens an issue: "Please add coverage for example.com" and another persons issue says "Coverage for Example domain." This is a problem, it means we can't write a script to auto-label for ruleset requests, as we have for PRs. The only reason it works for PRs is because we have code we can parse to determine prioritization. So just using the existing tags in GitHub isn't enough, we need an automation system where users solicit coverage in a standardized way. You yourself have complained in the past about having a separate issue for every coverage request. It gets unwieldy quickly.

One idea comes to mind. Having a separate repository for rulesets is a good long-term goal, and perhaps we can move requests for coverage into such a repo before the actual rulesets are separated out. This will neatly separate core codebase issues from ruleset issues. Another thing that may help is creating an issue template which explains that if you're requesting a new ruleset, enter the domains requested in an easily parsable manner (the formatting of which we can explain). How does this sound @strugee, @jeremyn?

jeremyn commented 7 years ago

@Hainish I understand better what you're looking for now: a way to auto-prioritize domains that people enter, so people can tell us "Please add coverage for example.com" and then example.com shows up on a list somewhere with its Alexa ranking attached. Now that I understand it, I agree with that goal.

This is really just a survey with one question: what domain do you want to let us know about? We can probably use Google Forms for this: intake on a form, do basic cleanup like removing leading and trailing whitespace, and add it to a Google spreadsheet while joining it to the Alexa data stored in another Google spreadsheet.

I'm not sure whether splitting the rulesets off into another repository is overall good or bad, but since I rarely deal with the non-ruleset code my opinion may not be relevant. I don't think it matters much for the specific problem of streamlining receiving requests from users, though.

rugk commented 7 years ago

All right, just don't use such a proprietary service as Google Forms. You already noticed it is easy to do and this is an EFF project after all…

jeremyn commented 7 years ago

A basic unanswered question here is how much time and effort the EFF is willing to put toward maintaining any solution day-to-day. For example if the people the EFF is willing to allocate aren't technical enough to maintain a SQL database, then we have to rule out any solution that involves a SQL database regardless of the technical merits. If they aren't willing to plan for or react to being DDOS'd then we have to rule out self-hosting. If they aren't willing to periodically review and prune user input for offensive or illegal values, then we can't make user input public. Etc.

It's like we collectively are consultants and the EFF is our client. If the EFF can't provide a maintenance budget then there's no point in planning to do the work, other than as an intellectual exercise.

strugee commented 7 years ago

Hmm, I must have misread your email, @Hainish.

Here's a revised MVP flow (which again I don't think will take very long to implement):

User signs in with GitHub
User is presented with a screen allowing them to submit a domain
Domain submission results in a new issue being opened in a separate repo - as @Hainish suggested - from the user's GitHub account by the webapp (with a note at the bottom about it being automatically generated by the app, obviously)

Already there we have the ability to take in submission requests in a structured way that splits codebase issues from ruleset issues. And as a bonus, to @jeremyn's point - from an implementation perspective it's completely stateless so there isn't a lot of maintenance burden.

Some possible additional features after the MVP:

Autolabeling issues based on Alexa rankings
Additional data that will be submitted along with the issue - a checkbox for known MCB issues, a freeform text field for additional notes, etc.
Metadata handling, as discussed above (although thinking more about this, this may be more suited to the GitHub wiki and better off not involving this app at all?)
Automatic spot checking that looked for common issues that make the domain not really a candidate for rulesets
- Not sure if this is a big issue - if not, probably not worth implementing
- We'd also have to try not to duplicate the work @jeremyn linked to - thanks for that, by the way! I wasn't aware of those projects; I certainly don't want to spread efforts even thinner
The ability to file bugs on existing rulesets
- The ability to look up an existing ruleset, provide a testcase URL where that ruleset breaks, and select how (have the software determine how?) that testcase breaks would be really neat
- Should this integrate with Atlas in some way? Maybe cross this bridge when we come to it

Obviously this is closely related to having a separate repo for rulesets; I've put some thoughts on that here: https://github.com/EFForg/https-everywhere/issues/2697#issuecomment-303036469

I'm really leaning towards the third option discussed in that comment.

jeremyn commented 7 years ago

If we are restricting the ability to submit new domains just to people who have GitHub accounts, then we can accomplish the goal by making the separate ruleset repository and adding an issue template. We don't need an application for that. @Hainish 's existing Alexa script can handle Alexa autolabeling.

It would be more useful if anonymous, less-technical people without GitHub access could submit domains, perhaps through a single text input field form on or under https://eff.org/https-everywhere. However that introduces the various concerns I described in my previous comments.

strugee commented 7 years ago

@Hainish, I wonder what you think of https://github.com/EFForg/https-everywhere/issues/6322#issuecomment-303191656? @jeremyn has some good points; the less code we write the better - and I definitely keep falling into the trap of overengineering things in this thread.

Seems like maybe the best way to approach this is to start with an issue template and implement features I've listed above that don't fit into the template with a bot, which would basically just add some additional automatically-determined information to each issue.

We could use the same bot to allow anonymous submission, as Jeremy suggested. Probably with some sanity-checking to make sure the submission isn't spam.

strugee commented 7 years ago

@Hainish ping?

Hainish commented 7 years ago

@strugee I'm honestly worried that even with an issue template, people will incorrectly format the issues they submit. For instance, if we have a template like this:

Please enter the hosts you wish to see coverage for in the following format: Hosts: www.example.com, example.com

I could easily see this submitted:

Hosts www.example.com & example.com

This is because people might not understand that strict formatting is necessary for their issue to be prioritized properly.

There are two ways to handle this.

One is to create the webapp as you originally intended, ensuring the domains are submitted with the correct formatting by enforcing this when the user is actually submitting the request. This then is made into an issue on GitHub. The benefit of this is immediate feedback. The drawbacks are that you have to write more code to provide this interface, and think about user logins and perhaps anonymous submissions.

The second way is to have a bot that just looks over the latest open issues (in a similar manner that I look over just the latest open PRs in hsts-prune) and if an issue is improperly formatted, add a comment that states this, and close the issue.

I kind of think a submission portal for domains is nicer, because of the immediate feedback it provides, and it also seems way more intuitive from the perspective of a submitter. I'd suggest doing away with user logins and just having a single, new user that we create submitting issues once the details are gathered. Many people (including myself) are not keen on giving a third party application permission to post with our GitHub account. Without logins, it's simpler, and we can at the end of the workflow provide a link to the GitHub issue so if someone wants, they can log in themselves and provide additional comments or see how the progress on this is going. We'd have to have some spam-prevention mechanism that isn't using 3rd party includes such as recaptcha, but there are privacy-friendly alternatives out there.

To summarize, my preferred app looks like the following:

A webapp that allows anonymous submission of coverage requests, which are then submitted to a new repo by a new user that we create
It will have fields like domain (which is regex-checked for formatting errors), relationship to domain which could be a drop-down like owner, webmaster, other, and perhaps something like notes
Upon submission, check what position the domain is within the Alexa rankings, and auto-tag it.
Have some anti-spam mechanism
Provides a link at the end of the workflow to the new GitHub issue we've created.

How does this all sound?

jeremyn commented 7 years ago

I think it would be worth making the new repository, adding the issue template, and running the autolabeler for a while to see what happens, before anyone makes a web app. After some time we can estimate how much reviewer time was spent labeling poorly written issues. My guess is that a web app will cost 50-100 hours of dev/maintenance time over the next year to save about 15 minutes a month of manual work.

Hainish commented 7 years ago

@jeremyn the problem there is that you don't know what submitters you've lost due to lack of easy/anonymous access (e.g. no GitHub account).

If it's coded as a regular issue-scanning script rather than a webapp, a lot of the code from the hsts-labeller can be reused. For instance, the scanning part, and also the labelling part. The only part that would have to be coded is the format-parsing, commenting, and closing if it's of a non-matching format.

Hainish commented 7 years ago

@jeremyn you've convinced me, I think it's fine to code as an issue-scanning app. This also wouldn't require standing up a web server, which requires greater resources and more deployment time from our internal TechOps team at EFF.

@strugee do you have a clear idea of how this might be implemented, given the discussion?

jeremyn commented 7 years ago

Please don't autoclose misformatted issues. Some projects do that and it is so annoying. We don't have the volume to justify it.

People do make throwaway GitHub accounts to anonymously contact us. You could also gauge interest in anonymous submission by setting up a special @eff.org email address where you take domain requests, and encourage people to send the requests from an anonymous account. You can then manually create issues for those domains. If there's big interest then that argues in favor of a web app or some other heavy duty approach.

Hainish commented 7 years ago

@jeremyn the problem I see is that if a malformatted issue is not autoclosed, it will linger in the repo issues without ever getting labelled. We could, for instance, label all malformatted issues as malformatted, but this causes extra work for maintainers (either to close that issue or open a new one where the formatting is correct.)

Hainish commented 7 years ago

I think it's appropriate to close with a polite comment, especially if we have well defined guidelines in the issue template.

jeremyn commented 7 years ago

Autoclosing an issue with a bot is like the harshest thing you can do to an issue, regardless of how nice the form comment the bot adds is. It is just not worth doing that here.

Ansible is a good example of a repository that uses a bot (@ansibot). Keep in mind the Ansible repository probably gets ten times more traffic than we do and their issues are much more complex. Here are closed issues where ansibot has commented. In https://github.com/ansible/ansible/issues/24982#issuecomment-303737725 you can see ansibot adding a comment asking for more info but it does not close the issue. When an issue needs to be closed, a person gets involved, for example see https://github.com/ansible/ansible/issues/24956#issuecomment-303555936.

By the way, the ansibot code is publicly available here.

Hainish commented 7 years ago

If this is in a separate repo which has formatting guidelines on opening issues, I see absolutely no problem with closing issues which don't follow these guidelines. This avoids cluttering our issue queue with unparsable issues which will never be addressed.

jeremyn commented 7 years ago

Unparseable doesn't mean unreadable to a human. Just give it a needs_info tag like ansibot does and move on. You can filter these issues with -label:needs_info when searching if you want.

To be clear, both of our positions are defensible. The thing is that in my opinion the by far most serious need HTTPS Everywhere has is getting more contributors and especially more reviewers. So when a choice has to be made, I prefer approaches that are friendlier to new contributors in the hopes they'll stick around. Some projects are flooded with contributors and bad issues and for those projects reducing the noise is more important than attracting new contributors. I don't think that's HTTPS Everywhere, though.

strugee commented 7 years ago

Hey, sorry I haven't replied to this yet. I have a pretty good idea of how this should look.

Re: closing, let's punt on it for now. It's super easy to change later anyway. My guess is that if we add a note in the issue template explicitly saying, "don't change the formatting, this is read by a bot," people will do a better job. We can also improve the bot so it gets better at dealing with common formatting problems. I can also write in functionality where you can ping the bot to reparse an issue once formatting is fixed.

So, I think the next action item is to create a repo, right? @Hainish could you make that happen, and give me admin (or just write) access to it? Also, do we want the bot to have a separate repo? I vote yes; I think one of the biggest benefits to splitting out the rules is that you can clone the repo and get just the rules with no code. (Unless we wanted the bot to sit in this repo.)

jeremyn commented 7 years ago

@strugee You can get started developing the bot code in a separate repository that you own. When it's ready or at least near completion, the code can be transferred to the EFF if they think that's appropriate.

For development you might want to automate creating a small issue repository with each issue as a single test case. Also, test mocks for the GitHub API seems to be a well-traveled path though I can't give personal recommendations. In any case running the bot against an EFF-owned repository should come at the end of the development, and the bot should probably be owned by either @Hainish , or @Hainish can create a new bot-only user and run the bot under that.

Hainish commented 7 years ago

I was going to respond basically with what @jeremyn said exactly. When it's ready to be handed off, we'll simply have to create a new GitHub API key, repo, and issue template for that repo.

@strugee one thing I've been doing recently for these standalone tools is putting them in the utils/ folder within HTTPS Everywhere. This keeps all the associated utilities for the project in one place, which I consider a bit neater. Also, I've been creating Dockerfiles for easy deployment. If you're familiar with docker and are so inclined, by all means follow suit. Otherwise I'll probably just dockerize it after you hand it off.

As far as deployment goes, I'll most likely just stand this up on the same server as we host the labeller.

strugee commented 7 years ago

In any case running the bot against an EFF-owned repository should come at the end of the development

Oh yeah, that was always the plan. Wouldn't want to spam the tracker :)

@strugee one thing I've been doing recently for these standalone tools is putting them in the utils/ folder within HTTPS Everywhere. This keeps all the associated utilities for the project in one place, which I consider a bit neater.

Cool. So I'll develop this in utils/ in a branch and send a PR then?

Also, I've been creating Dockerfiles for easy deployment. If you're familiar with docker and are so inclined, by all means follow suit.

I'm not but I've been meaning to learn. So I might try my hand at writing a Dockerfile anyway.

strugee commented 7 years ago

Ping @Hainish

Hainish commented 7 years ago

@strugee yes, develop in utils/ and send a PR please

strugee commented 7 years ago

Just as an FYI, I'm still actively working on this :)

Taking a while because of a bunch of mocks and stuff that need to be written for the tests. But after all that's done it should be really easy to develop and test without setting up GitHub and stuff.

Hainish commented 7 years ago

Thanks @strugee

ghost commented 7 years ago

@Hainish Can we auto-close the issues if they are not corrected for a week after they were first posted?

Hainish commented 7 years ago

I'm in favor of this approach, but @jeremyn had strong opinions against auto-closing so I'll let him argue the point.

jeremyn commented 7 years ago

I've already argued that position in this issue.

EFForg / https-everywhere

Allow easier input of domains requested to be reviewed #6322