18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
369 stars 139 forks source link

Improving sslyze caching for SMTP scans #254

Closed jsf9k closed 6 years ago

jsf9k commented 6 years ago

This pull request adds code that, when running an sslyze scan against an SMTP server, checks the sslyze cache to see if we already have results for the SMTP server, possibly from a different domain.

I have found that SMTP servers (as compared to HTTP/HTTPS servers) are much more sensitive to having multiple connections made to them. In testing the various ciphers sslyze makes a lot of connections, and multiple domains often use the same SMTP servers, so if the user specified the --cache flag then it makes sense to check if we have already scanned this mail server when testing a different domain.

Note that as the sslyze cache grows in size sslyze SMTP scans run slower because reading all the JSON files in the cache takes time. There are considerably fewer errors in the scans, though.

jsf9k commented 6 years ago

CircleCI seems to fail when checking out the repository:

Cloning into '.'...
Warning: Permanently added the RSA host key for IP address '192.30.253.112' to the list of known hosts.

Permission denied (publickey).

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Exited with code 128
IanLee1521 commented 6 years ago

That seems odd, given that it is a public repository...

jsf9k commented 6 years ago

@IanLee1521 it looks like CircleCI is checking out the repo with an ssh URL.

IanLee1521 commented 6 years ago

I guess I would have assumed that they (CircleCI) would have an SSH key in GitHub that would allow that to work...

konklone commented 6 years ago

I guess I would have assumed that they (CircleCI) would have an SSH key in GitHub that would allow that to work...

Yeah, I think that's my bad and that I busted the integration, when I revoked what I thought was an individual-level permission but which actually had org-level implications. I'll fix this up.

konklone commented 6 years ago

I completely get the problem here, but I feel like a more efficient/elegant solution would be to keep a single in-memory (or maybe on-disk) cache of hostnames as they've been seen during the scan, rather than searching the whole file system each time for each scan. That seems pretty time/CPU-expensive.

The environment dicts sent to init_domain are dupe'd each time to prevent individual scans from affecting each other, so it'd have to either be a new dict, or stored elsewhere. The main complexity I see is that there could be race conditions depending on the approach. Any thoughts?

I also think that --cache may not be what you want to rely on here -- that's typically used to resume scans. In our case we need to not use --cache so that we definitely run a fresh scan of the outside world overall, and you wouldn't want an intra-scan caching mechanism to cause the entire scan to rely on older data. However, your use case may differ.

jsf9k commented 6 years ago

@konklone, I agree with what you're saying. I was trying to support my use case without touching anything outside of domain-scan/scanners/sslyze.py. If you'd be OK with me modifying domain-scan/scan I can add support for a separate CLI flag (something like --use-in-memory-cache) and an extra argument in_memory_cache that gets passed into init_domain() and is updated in perform_scan().

In other words, if --use-in-memory-cache is present then the disk cache will be duplicated in in_memory_cache by perform_scan() and passed to init_domain().

Let me know if you're OK with that, or if you have better ideas.

jsf9k commented 6 years ago

I'm going to close this PR and open a new one.