EFForg / https-everywhere

A browser extension that encrypts your communications with many websites that offer HTTPS but still allow unencrypted connections.
https://eff.org/https-everywhere
Other
3.37k stars 1.09k forks source link

Would you provide one tool to generate rules? #6912

Closed ivysrono closed 7 years ago

ivysrono commented 7 years ago

To search subdomains , get cert status, auto check in local, and so on.

jeremyn commented 7 years ago

That would be nice. I mentioned wanting those sorts of tools here.

fuglede commented 7 years ago

I'm not sure what you mean by cert status and auto-checking in local exactly, but it sounds like what our tests achieve? Those you can (and probably should) run locally.

It would be nice to have a subdomain finder utility as well though (cf. #4279).

jeremyn commented 7 years ago

It should be straightforward for one script, given some wildcard domains, to generate a ruleset that pre-resolves the to-do list in the various pull requests I link in issue #6863 , including the one for udn.com that @renyouguo was involved with.

Creating a script to use Google to filter domains, run curl to note basic failures, and create an XML file to note the results should be within the skill-set of a beginner programmer. It would be slightly more complicated to identify mixed content and to crawl sites looking for static.-type domains, but not too much harder. Here's a simple Google search discussing solutions to the mixed content problem. Humans should only be needed to write exclusions and custom rules.

The real challenge for anyone wanting to make this happen is being willing to focus on a medium-sized task like this, make it work within the existing ecosystem, and endure the inevitable bikeshedding from myself and others.

Foorack commented 7 years ago

@jeremyn This idea has been catching my interest for a while. I've done an attempt on it already but it's not finished yet. It can be optimized a lot but this is only a proof-of-concept. To test it I have been running it against a portion of the europa.eu domain as it has 788 functional subdomains. :+1: My personal test approach has been as following:

  1. First use Sublist3r (without the bruteforce option!) which uses Google, Bing, dnsdumpster as well as a few other sources to find all subdomains. This does generate some false-positives which I sort out in next step. (if you try to run the script then the headers as well as the colour chars in the output from sublist3r needs to be removed.)
  2. Curl with HEAD probe to see if the subdomain responds to basic http. If not then should be safe to assume it was a false-positive.
  3. Compiles ruleset with python script.

I am still planning on implementing automatic mixed-content checking, maybe with bramu's mixed-content scan https://github.com/bramus/mixed-content-scan but most features such as different content, https->http redirect, timeouts, certificate-chain errors and so on are already implemented and working.

Proof-of-concept source code: https://gist.github.com/Foorack/51893f7d0f8b16d6d8be25073131d4b6

Example output of running it on a subset of 50 subdomains on europa.eu: ( domains are auto-sorted :) ) https://gist.github.com/Foorack/7be99f802b0943932f06e5c1f28973bd

Update: Updated bash script to fix the bug mentioned in step 1.

jeremyn commented 7 years ago

@Foorack That's awesome that you're already looking at that! Thank you. Please feel free to submit it as a pull request to get more feedback when you're ready. A basic tool that is ready to use will be more useful than something with big plans that we never see, you know what I mean?

I haven't reviewed your code in super detail, and not to make huge criticisms at an early stage, but two big structural changes I'd like to see are:

ivysrono commented 7 years ago

lijiejie/subDomainsBrute: A simple and fast sub domain brute tool for pentesters Maybe better than Sublist3r.

Foorack commented 7 years ago

@jeremyn Thanks! A few questions however. First, as this have external dependencies, would it be best to make it a sub-project or just push it as a utility script? Secondly, I am a beginner python developer and I am still shocked the code actually works as I intended. Therefor I assume there are lots of improvements which can be made to the code. This is only a proof-of-concept hence the use of bash; my plan is to port it over to python once I get the fundamentals working. Regarding the second point, for what reasons would it be beneficial to switch to a XML writer instead of the current system printing to a output stream? Unless it poses clear improvements, I don't see the point in writing the comment section to a external file.

@renyouguo I have already seen that project before. I have started experimenting with this idea but I am absolutely in no position to decide what approach should be made! Having that said, I am strongly against the idea of bruteforcing subdomains. It increases the probability of finding internal or debug DNS addresses where searches on Baidu and Google would in most cases only return address used by the public.

jeremyn commented 7 years ago

@Foorack In the end this tool should go into the main HTTPS Everywhere repository like https_everywhere_checker, but for development you can do whatever you want. You might find it easier to make a separate project for development rather than try to manage changes and feedback in a pull request or Gist.

XML writing is a solved problem and you should use an existing solution rather than reinvent the wheel. I recommend lxml since it's already a dev dependency.

ivysrono commented 7 years ago

@Foorack Sublist3r only works in Python2.7 . While #6749 be fixed someday, change it again?

Foorack commented 7 years ago

I just realised this idea cannot be efficiently realised. Lets say this program is "finished" so it efficiently covers every possible situation, how would the reviewers possibly be able to keep up and review the rulesets? If this program is created then what would be stopping someone from just hooking it up with the Alexa top 100-million list? I think questions like these needs to be sorted out before the actual implementation of this program.

Example: https://github.com/EFForg/https-everywhere/pull/6991

jeremyn commented 7 years ago

@Foorack The idea is that this tool could be used by a contributor who wants to make or update a ruleset for a site they're interested in. It would produce such a high quality ruleset that the reviewer only needs to make a few tweaks, if any.

I would rather have the problem of having too many high quality rulesets to review, than to have go through tedious testing and style corrections for most rulesets that for example you and I did in pull request #6857 . I know you mean well, but you are proposing that we slow down the ruleset creation process by not creating helpful tools so both contributors and I have to spend more time on tedium.

I agree that in this hypothetical post-scarcity world where we are flooded with millions of high quality rulesets, we'll have new concerns, for example see the discussion here and following. But we could just make new rules, such "everything in the Alexa top-10k is okay, anything below must pass some sort of Wikipedia-style notability test".

galeksandrp commented 7 years ago

https://github.com/galeksandrp/https-everywhere/tree/check

git clone -b check https://github.com/galeksandrp/https-everywhere.git ~/workspace
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 -O - | tar xj -C ~/workspace
export CLIENT_ID= #GitHub OAuth Application client id
export CLIENT_SECRET= #GitHub OAuth Application client secret
export CSE_CX= #Google Custom Search CX
export CSE_KEY= #Google Custom Search key
chmod +x ~/workspace/*.sh
cd src/chrome/content/rules
~/workspace/generate.sh eff.org

Upd1 https://github.com/galeksandrp/https-everywhere/tree/check-sublist3r Version with Sublist3r. Problem is that Google limiting requests (not applicable to Google CSE, but Sublist3r does not support Google CSE, but theHarvester does, but it retrieves a lot less domains than Sublist3r.

jeremyn commented 7 years ago

@galeksandrp For most of your pull requests, do you just find a domain that needs HTTPS support and then run your script against it?

galeksandrp commented 7 years ago

@jeremyn Yes. I run it on all domains I see on internet even if it already HTTPS by default. HTTPSE is like HSTS preload for webmasters that lazy to add single header in webserver config, isn't it? However, Google Custom Search is limited to first 100 results (that's why I can't close famous aist.go.jp and fnal.gov rulesets). It also can't check mixed content yet, though there are https://www.jitbit.com/sslcheck/ and other phantomjs checkers and even curl | grep 'http://'. Looks like in end similar checker by @Hainish will be attached to Github Issues webhook and there will be no more reviewing at all.

And I also bored adding Cloudflare protected sites, we should force HTTPS for theirs IP and mantain list of exclusions.

jeremyn commented 7 years ago

@galeksandrp It's cool that your process is far enough along to generate human-quality rulesets. It might be a good base for the tools we're discussing in this issue.

The thing is that the PRs generated by your process sometimes have simple problems. They leave out domains that can be found with an easy search, or they have simple mis-sorting. I can find these problems in your PRs from last week, a month ago, two months ago. You've submitted and we've merged enough of your PRs that I'm sure you can find these problems yourself. I think automation is great and I hope you keep working on it, but please review and correct the output before submitting it as a PR, and also go through your existing open PRs and do the same thing.

I don't understand your point about fnal.gov (PR #5265 ) and aist.go.jp (PR #5222 ). You can just go in and make the changes yourself, in fact for fnal.gov I gave you a long, specific list.

jeremyn commented 7 years ago

I created a pull request for Sublist3r to modify the output sort order to match what we like, see https://github.com/aboul3la/Sublist3r/pull/38 .

jeremyn commented 7 years ago

@aboul3la has merged my sorting change into Sublist3r (thanks!), so the subdomains produced by Sublist3r should now be sorted the way we want it.