[scanner integration] Enable landing page scanner and collect baseline scan data

eloquence commented 6 years ago

As one of the preparatory steps for #488, we should enable the existing landing page scanner and collect baseline data for all current SecureDrop directory entries. This will also take the scanner through its paces and help us discover operational issues that need to prioritized prior to a full integration.

Provided the scanner is stable, as part of this task, we should generate a CSV of the scan results for all landing pages.

Note that all code that displays information on SecureDrop.org based on scan results must be/remain disabled. Scan results should only be visible in the Wagtail admin interface, and be otherwise without consequence for the site.

eloquence commented 6 years ago

Running manage.py scan in production once would be a good start, but to close this task, we should also generate the aforementioned CSV dump. We don't have to enable a cron-job yet as we'll likely have to make a few fixes to the scanner logic base don the initial data.

redshiftzero commented 6 years ago

Heads up the cron job should not generate a ton of data because the scanning should only add new rows when a scan result changes (see https://github.com/freedomofpress/securethenews/issues/84#issuecomment-302789555).

conorsch commented 6 years ago

Running now, will report results.

conorsch commented 6 years ago

:red_circle: Negative:

(securedrop-alpha)gcorn@www-sd:/var/www/django-alpha$ ./manage.py scan
Traceback (most recent call last):
  File "./manage.py", line 12, in <module>
    execute_from_command_line(sys.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/www/django-alpha/directory/management/commands/scan.py", line 46, in handle
    bulk_scan(securedrop_pages)
  File "/var/www/django-alpha/scanner/scanner.py", line 114, in bulk_scan
    securedrop = securedrops.get(domain=result_data['Domain'])
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/db/models/query.py", line 384, in get
    (self.model._meta.object_name, num)
directory.models.entry.MultipleObjectsReturned: get() returned more than one DirectoryEntry -- it returned 2!

From that output it's not obvious which entry triggered the additional result. Re-running with --verbosity 3 didn't change the output, so perhaps we can add support for verbosity levels to aid in debugging.

harrislapiroff commented 6 years ago

Just encountered this bug myself. We started allowing directory entries that have the same domain, but didn't update the scanner code to account for it.

I see two plausible solutions here:

Update the scanner to allow multiple scans on the same domain. This will result in some redundant scans, but would be pretty quick to do.
Rewrite the scanner to associate scan results with a domain rather than a directory entry. This would eliminate redundant scans but would take some effort.

My preference is 1, right now. I could imagine scenarios down the road where we might want the scanner to run separately on instances with the same domain. For example, if the scanner someday scans onion services as well as landing pages, we might have entries that share a domain but have different onion services and therefore would want to be scanned separately.

eloquence commented 6 years ago

I'm confused by your use of "domain" here. We have entries that share the same onion URL ([org example redacted]) but AFAICT we have no entries that share the same landing page URL. So if the scanner currently dies in the [redacted] case, it most definitely should scan both, since they're different landing pages.

harrislapiroff commented 6 years ago

@eloquence You're right—we have shared onion URLs, not shared domains. Now I'm not totally sure again what's going on. I'll investigate more. 🕵🏻‍♀️

harrislapiroff commented 6 years ago

I've identified the cause of this error: [org example redacted] This still leaves a few resolutions for a general solution to this problem. We could disallow instances sharing a domain, but I maintain that there may be multiple securedrop instances on the same domain (this issue would crop up even if they had separate landing pages, e.g., https://organization.tld/subgroup1/ and https://organization.tld/subgroup2/) because pshtt scans per domain, not landing page. (We have separate, internal, code for scanning for landing page specific things.)

My favored solution right now, actually, is to tell pshtt to scan a deduped list of domains and then write the results of each domain scan to all entries that have that domain. This resolves both the uniqueness problem and the redundant scan problem. In pseudocode:

domains = set(securedrops.values_list('domain', flat=True))  # set to dedupe
results = inspect_domains(domains, {'timeout': 10})
for result in results:
    securedrops.filter(domain=result['Domain']).update(result=result)  # could update multiple
    # actual code will be more complex in this part but you get the idea

eloquence commented 6 years ago

Thanks, @harrislapiroff . Since we currently don't have the problem you describe (at least as far as we know), now that the offending entry has been fixed, I'm assuming the scan command should run to completion. Flagging to @conorsch that we should attempt another test run here, and if that runs to completion, will file a separate issue for the problem you describe.

conorsch commented 6 years ago

Re-running ./manage.py scan in prod to evaluate results.

conorsch commented 6 years ago

Success!

(securedrop-beta)gcorn@www-sd:/var/www/django-beta$ ./manage.py scan
Scanning complete! Results added to database.

Please inspect results on the live site now, @eloquence!

eloquence commented 6 years ago

The results on the live site are mostly what I'd expect:

I see correctly dated results for all sites I've examined.
The grading is completely off, but that's to be expected (it grades extremely harshly on some criteria, e.g., cookies, while ignoring other stuff we think is important) and we don't anticipate using those grades. Ignore them.
Redirects are causing a fair number of problems -- it's fairly common for landing pages without a trailing slash to redirect to the version with a trailing slash [example redacted] . #494 will fix this; in the meantime, we should probably update landing page URLs to the canonical version.
In evaluating the results I discovered a bug in the cache control validation, #524.

It would still be very useful to generate a CSV from the latest results, for more systematically assessing how common certain issues are (e.g., use of subdomains) across the entire set of sites in the directory. @harrislapiroff could you give an initial estimate of how much work that would be?

eloquence commented 5 years ago

Just making a note that the relevant PR that we're committed to finishing as part of the 8/8-8/22 sprint is https://github.com/freedomofpress/securedrop.org/pull/527. Would also be great to then do a prod run of the scancsv command. Once that's done, we can close this issue, though we may intermittently need to re-run the CSV dump to verify improvements to the scanner.

eloquence commented 5 years ago

@conorsch ran a fresh scan and provided baseline scan results (below, CSV). As such this task is complete for the current sprint, closing. In principle we're ready to enable regular runs; filing a separate infra ticket for that. For now, will add observations to relevant tickets.

scan-2018-08-09.txt

freedomofpress / securedrop.org

[scanner integration] Enable landing page scanner and collect baseline scan data #518