18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
371 stars 137 forks source link

Scanner dependency problem #235

Open jsf9k opened 6 years ago

jsf9k commented 6 years ago

In #224 we separated requirements.txt into several different requirements.txt files, which I think is a good thing. In addition to one minor bug that I fixed in commit cc141091188b060a144a4dd9307262a96d06daf7 in #234, there is a larger issue with the scanners.

In pshtt.py and in sslyze.py we import pshtt and sslyze, respectively, at the top of the file. This means that when running ./scan even in Lambda one must have pshtt and trustymail installed locally, on the host where you are running ./scan. This goes against the intent of the splitting of requirements.txt in #224.

Note that trustymail.py does not have this problem because we only import trustymail in the scan() function. Hence it is only imported inside the Lambda function and not by the host where ./scan is being run.

I'm not sure it's worth fixing this now, since there is an obvious workaround, but I wanted to make sure this gets taken into account in the work @tadhg-ohiggins is doing in #232. If the scanner classes can work in such a way that the dependencies specific to that particular scanner only get imported inside the Lambda function (when running in Lambda) then we can keep the local dependencies to a minimum.

I hope all this makes sense. Please ask me for clarification if it's not. :smiley:

konklone commented 6 years ago

This makes perfect sense. Though:

This goes against the intent of the splitting of requirements.txt in #224.

Splitting the requirements that way was meant (for me, anyway) to make packaging Lambda functions lighter by eschewing local-only reqs, but not necessarily to make local dependencies lighter by eschewing Lambda-only reqs.

I don't think there's as much of a need to do the latter, since there usually aren't the same kinds of constraints on network/size/etc locally as in Lambda. But maybe I'm not thinking about broader use cases. Would it help your use case to split this out further?

jsf9k commented 6 years ago

I don't think we need to split the requirements files out further. Ideally I'd like it if you only needed to install requirements.txt and the scanners you want to use locally. Right now that doesn't work because of the imports at the top of the files in scanners/, but I suspect we could get there if we moved some imports around in the scanner classes that @tadhg-ohiggins is working on (like scanners/trustymail.py is doing, although for a different reason).

Since my scanning kicks off the Lambda processes from Docker it's nice if that container doesn't have to actually install the actual scanning libraries. It's not a huge deal, but it does make the Docker images smaller. And it removes what are really unnecessary dependencies.

konklone commented 6 years ago

That makes sense, and is something I wasn't intuiting because I'm not currently using Docker in my environment.

The ideal is probably to have a standard way of conditionally importing dependencies.

One complication, though - for the pshtt scanner, pshtt is needed even locally anyway, because it uses pshtt.load_suffix_list() in the init method to manage caching the PSL locally. You're currently handling all of that in scan, using a locally packaged PSL, and don't bother with that optimization.

It seems very possible that other scanners may make use of third party deps in their init() or init_domain() or to_rows() functions, not just their scan() functions. That may make it more complicated/annoying to do that kind of separation.