I added logging to the scans API.

There are two issues I've discovered so far: 1) There are >1 scans for domain, scantype, and date combination. 2) There are 0 scans for a domain, scantype, and date combination.

>1 scans

When I call the 18f.gov URL it returns a 500 and I now have this convenient log message, which shows that there are two scans for 18F.gov.

18f.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

fbi.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

I haven't confirmed the root cause of the multiple scans yet but it's likely an issue with how we're dealing with dates in the API.

0 scans

ojjdp.ojp.gov

16:47:39.578: [APP/PROC/WEB.0] 2020-08-18 20:47:39,578 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'ojjdp.ojp.gov'}}]}}, 'sort': ['domain']}

healthreach.wip.nlm.nih.gov

18:31:35.878: [APP/PROC/WEB.0] 2020-08-18 22:31:35,877 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'healthreach.wip.nlm.nih.gov'}}]}}, 'sort': ['domain']}

I have discovered the cause the zero scans issue. This happens when the domain, scantype, date combination doesn't run.

This requires a bit of an explanation: The getdomains.sh script, gets all of the domains that we care about (https://github.com/GSA/data/raw/master/dotgov-domains/current-federal.csv) and (https://github.com/GSA/data/raw/master/dotgov-websites/pulse-subdomains-snapshot-06-08-2020-https.csv) and merges them together. It then splits them into separate CSV files ordered alphabetically by domain.

xaa.csv
xab.csv
xac.csv
xad.csv
...
xai.csv

Each of these files is then spun up as separate tasks on Cloud.gov when CircleCI runs the nightly scan job. The issue is that if there isn't enough memory quota, we can't spin up a new task. So the domains contained in xae.csv through xai.csv never actually run.

The domain ojjdp.ojp.gov is in file xag.csv and the domain healthreach.wip.nlm.nih.gov is in xaf.csv.

Possible Fixes

Reduce memory consumption by turning off lighthouse and third party scans.
Increase memory in Cloud.gov
Reduce the number of domains that we are searching.

For now we want to go with option 1.

Originally posted by @alexbielen in https://github.com/18F/Spotlight/issues/628#issuecomment-675715728

18F / site-scanning

Investigate n > 1 scans issue #672