18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
370 stars 137 forks source link

Add fast caching #255

Closed jsf9k closed 5 years ago

jsf9k commented 6 years ago

This pull request adds support for fast (currently in-memory) caching for scanners. The only scanner that currently supports this is sslyze.

I have found that SMTP servers (as compared to HTTP/HTTPS servers) are much more sensitive to having multiple connections made to them. In testing the various ciphers sslyze makes a lot of connections, and multiple domains often use the same SMTP servers, so it makes sense to check if we have already scanned this mail server when testing a different domain.

This is a complete redo of #254.

jsf9k commented 6 years ago

I'll get back to this, probably next week.

jsf9k commented 5 years ago

@konklone I know it's been forever since this PR was created, but I was revisiting it today. I'm not understanding your proposed solution, and I had the same difficulty before. I think I'm missing some key point.

The problem I see is that the only place in scanners/sslyze.py where I can modify the fast cache is inside init_domain(). I can't update the fast cache when I get results back in scan() because the fast cache has been removed from the environment at that point in order to avoid transferring it to Lambda.

Furthermore, my intent was to use this fast cache solution for trustymail as well since it has similar problems in that the mail servers that are actual scanned are sometimes repeated across domains with completely different names. I think it is useful for this feature to be (like Lambda) available for opt-in by other scanners if folks choose to support it.

Anyway, I'm hoping to reopen this pull request for discussion since I really need this feature in trustymail and sslyze. :grin:

Also, happy thanksgiving to everyone here! :turkey:

konklone commented 5 years ago

@konklone I know it's been forever since this PR was created, but I was revisiting it today. I'm not understanding your proposed solution, and I had the same difficulty before. I think I'm missing some key point.

Well, reviewing my comments and the code, I don't totally get all of my own past comments either. But the main one that still makes sense to me is the inefficiency of iterating through the cache list each time, as opposed to using a dict. Something like environment['fastcache']['mail.example.gov'].

jsf9k commented 5 years ago

I will fix the failing tests after lunch. The tests that fail appear to be those that test the scan command line options, so I suspect they are failing because I added an option.

jsf9k commented 5 years ago

I think this pull request is finally ready for review and merging.

jsf9k commented 5 years ago

@konklone, have a quick look at my most recent two commits when you have time. If you're happy with them then I'll merge.

konklone commented 5 years ago

Awesome, looks great! :shipit: