Restructure scanner further

harrislapiroff commented 6 years ago

Following up on #411, I think it might make sense to restructure the scanner module in this repo further and maybe even make it sort of configurable.

Right now the scanner 1. performs a domain scan using pshtt and 2. analyzes the HTML of the site with BeautifulSoup and our own custom criteria. I could imagine us wrapping more functionality into the scanner down the road or customizing it for different uses.

Right now I'd like to decouple the scanner.py module from the repo-specific models like SecuredropPage and Result and provide some utility methods that form the glue between those. The scanner itself should just take a URL or list of URLs and return the results.

harrislapiroff commented 5 years ago

Seeking thoughts from @chigby on this!

chigby commented 5 years ago

We are actually pretty close to this already. Right now the scan and bulk_scan methods are fairly thin wrappers around a perform_scan function that actually does most of the work. I think we could fairly straightforwardly:

Move scan and bulk_scan and their associated model-logic into the directory app.
Update perform_scan to return a dictionary or other generic object containing the same fields it does now.
Excise a few stray references to Django within the scanner to have a fully independent scanner app that could potentially be completely lifted out.

One thing I've been doing a little thinking about for future restructuring is having the scanner return more detailed results that pertain strictly to the landing page being scanner, while giving the responsibility of the directory to process those results into a pass/fail/moderate/severe grading scale. One immediate advantage of doing this is the potential for more informative error messages (either to be displayed on the entry page with the warning) or for use by us to work with the landing page provider to bring their page up to our standards. For example, right now the only warning text that contains specific information is the subdomain warning, which we are only able to show because the subdomain is part of the landing page address. Other warnings, like the presence of a CDN or analytics, are just True/False and don't give any more details to the user of the site or to us when we try to investigate it. I could see many of the True/False fields returned by the scanner changed to being more specific as to what data was directly detected and judgment of that data left to the user.

Having this as part of a future restructuring plan I think would provide some concrete benefits to use for how we're using the scanner now, and also make it a more flexible tool for use in different projects and contexts.

freedomofpress / securedrop.org

Restructure scanner further #418