18F / domain-scan

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
Other
371 stars 137 forks source link

Make Lambda packaging aware of third party dependencies, incorporate them in repackaging #219

Open konklone opened 6 years ago

konklone commented 6 years ago

The pshtt and trustymail scanners each use the PSL, and the pshtt scanner also uses the Chrome preload list and HSTS preload pending list.

The latter case (Chrome preload and preload pending lists) is handled without needing to be repackaged, because I'm able to do a sort of hacky workaround where I slice the lists down to just the domain being passed in, and send that up dynamically as part of the payload to the function. That does work (the function then just has to say "is 18f.gov in [18f.gov]?") but obviously isn't a general-case solution.

And in fact, the PSL can't work that way, as it has to be used in different ways during code execution than a simple "is in list" check. And so right now, the PSL is packaged in the function, but this means it gets stale. While not a super big deal for USG purposes, this is a much bigger deal with a general internet dataset.

Some thoughts I put down in https://github.com/dhs-ncats/trustymail/pull/74#discussion_r176947548:

Perhaps it's also worth domain-scan having a generalized solution to packaging Lambda functions with third party data sources. For example, a scanner could specify the source of third party data that is needed for that scanner to run, and the Lambda deploy process could automatically fetch and re-package them during packaging and deployment.

I think there is a fundamental tension between "don't have every Lambda instance make a network request to get this data" and "don't ever have to repackage Lambda functions to stay fresh with this data". I'm comfortable pushing some burden on the repackaging process (especially given how easy you've made it with Docker), and suggesting that staying fresh with the PSL and other sources means setting up (perhaps automated) repackaging of functions on a regular basis. Having repackaging be "aware" of third party dependencies per-scanner could make this easier.