Closed eloquence closed 5 years ago
Running manage.py scan
in production once would be a good start, but to close this task, we should also generate the aforementioned CSV dump. We don't have to enable a cron-job yet as we'll likely have to make a few fixes to the scanner logic base don the initial data.
Heads up the cron job should not generate a ton of data because the scanning should only add new rows when a scan result changes (see https://github.com/freedomofpress/securethenews/issues/84#issuecomment-302789555).
Running now, will report results.
:red_circle: Negative:
(securedrop-alpha)gcorn@www-sd:/var/www/django-alpha$ ./manage.py scan
Traceback (most recent call last):
File "./manage.py", line 12, in <module>
execute_from_command_line(sys.argv)
File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
utility.execute()
File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/var/www/django-alpha/directory/management/commands/scan.py", line 46, in handle
bulk_scan(securedrop_pages)
File "/var/www/django-alpha/scanner/scanner.py", line 114, in bulk_scan
securedrop = securedrops.get(domain=result_data['Domain'])
File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/db/models/query.py", line 384, in get
(self.model._meta.object_name, num)
directory.models.entry.MultipleObjectsReturned: get() returned more than one DirectoryEntry -- it returned 2!
From that output it's not obvious which entry triggered the additional result. Re-running with --verbosity 3
didn't change the output, so perhaps we can add support for verbosity levels to aid in debugging.
Just encountered this bug myself. We started allowing directory entries that have the same domain, but didn't update the scanner code to account for it.
I see two plausible solutions here:
My preference is 1, right now. I could imagine scenarios down the road where we might want the scanner to run separately on instances with the same domain. For example, if the scanner someday scans onion services as well as landing pages, we might have entries that share a domain but have different onion services and therefore would want to be scanned separately.
I'm confused by your use of "domain" here. We have entries that share the same onion URL ([org example redacted]) but AFAICT we have no entries that share the same landing page URL. So if the scanner currently dies in the [redacted] case, it most definitely should scan both, since they're different landing pages.
@eloquence You're right—we have shared onion URLs, not shared domains. Now I'm not totally sure again what's going on. I'll investigate more. 🕵🏻♀️
I've identified the cause of this error: [org example redacted]
This still leaves a few resolutions for a general solution to this problem. We could disallow instances sharing a domain, but I maintain that there may be multiple securedrop instances on the same domain (this issue would crop up even if they had separate landing pages, e.g., https://organization.tld/subgroup1/
and https://organization.tld/subgroup2/
) because pshtt
scans per domain, not landing page. (We have separate, internal, code for scanning for landing page specific things.)
My favored solution right now, actually, is to tell pshtt
to scan a deduped list of domains and then write the results of each domain scan to all entries that have that domain. This resolves both the uniqueness problem and the redundant scan problem. In pseudocode:
domains = set(securedrops.values_list('domain', flat=True)) # set to dedupe
results = inspect_domains(domains, {'timeout': 10})
for result in results:
securedrops.filter(domain=result['Domain']).update(result=result) # could update multiple
# actual code will be more complex in this part but you get the idea
Thanks, @harrislapiroff . Since we currently don't have the problem you describe (at least as far as we know), now that the offending entry has been fixed, I'm assuming the scan
command should run to completion. Flagging to @conorsch that we should attempt another test run here, and if that runs to completion, will file a separate issue for the problem you describe.
Re-running ./manage.py scan
in prod to evaluate results.
Success!
(securedrop-beta)gcorn@www-sd:/var/www/django-beta$ ./manage.py scan
Scanning complete! Results added to database.
Please inspect results on the live site now, @eloquence!
The results on the live site are mostly what I'd expect:
It would still be very useful to generate a CSV from the latest results, for more systematically assessing how common certain issues are (e.g., use of subdomains) across the entire set of sites in the directory. @harrislapiroff could you give an initial estimate of how much work that would be?
Just making a note that the relevant PR that we're committed to finishing as part of the 8/8-8/22 sprint is https://github.com/freedomofpress/securedrop.org/pull/527. Would also be great to then do a prod run of the scancsv
command. Once that's done, we can close this issue, though we may intermittently need to re-run the CSV dump to verify improvements to the scanner.
@conorsch ran a fresh scan and provided baseline scan results (below, CSV). As such this task is complete for the current sprint, closing. In principle we're ready to enable regular runs; filing a separate infra ticket for that. For now, will add observations to relevant tickets.
As one of the preparatory steps for #488, we should enable the existing landing page scanner and collect baseline data for all current SecureDrop directory entries. This will also take the scanner through its paces and help us discover operational issues that need to prioritized prior to a full integration.
Provided the scanner is stable, as part of this task, we should generate a CSV of the scan results for all landing pages.
Note that all code that displays information on SecureDrop.org based on scan results must be/remain disabled. Scan results should only be visible in the Wagtail admin interface, and be otherwise without consequence for the site.