18F / site-scanning

The code base for the first Site Scanning engine
https://digital.gov/site-scanning
18 stars 9 forks source link

Investigate n > 1 scans issue #672

Closed alexbielen closed 4 years ago

alexbielen commented 4 years ago

I added logging to the scans API.

There are two issues I've discovered so far: 1) There are >1 scans for domain, scantype, and date combination. 2) There are 0 scans for a domain, scantype, and date combination.

>1 scans

When I call the 18f.gov URL it returns a 500 and I now have this convenient log message, which shows that there are two scans for 18F.gov.

18f.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

fbi.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

I haven't confirmed the root cause of the multiple scans yet but it's likely an issue with how we're dealing with dates in the API.

0 scans

ojjdp.ojp.gov

16:47:39.578: [APP/PROC/WEB.0] 2020-08-18 20:47:39,578 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'ojjdp.ojp.gov'}}]}}, 'sort': ['domain']}

healthreach.wip.nlm.nih.gov

18:31:35.878: [APP/PROC/WEB.0] 2020-08-18 22:31:35,877 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'healthreach.wip.nlm.nih.gov'}}]}}, 'sort': ['domain']}

I have discovered the cause the zero scans issue. This happens when the domain, scantype, date combination doesn't run.

This requires a bit of an explanation: The getdomains.sh script, gets all of the domains that we care about (https://github.com/GSA/data/raw/master/dotgov-domains/current-federal.csv) and (https://github.com/GSA/data/raw/master/dotgov-websites/pulse-subdomains-snapshot-06-08-2020-https.csv) and merges them together. It then splits them into separate CSV files ordered alphabetically by domain.

xaa.csv
xab.csv
xac.csv
xad.csv
...
xai.csv

Each of these files is then spun up as separate tasks on Cloud.gov when CircleCI runs the nightly scan job. The issue is that if there isn't enough memory quota, we can't spin up a new task. So the domains contained in xae.csv through xai.csv never actually run.

The domain ojjdp.ojp.gov is in file xag.csv and the domain healthreach.wip.nlm.nih.gov is in xaf.csv.

Possible Fixes

For now we want to go with option 1.

Originally posted by @alexbielen in https://github.com/18F/Spotlight/issues/628#issuecomment-675715728

alexbielen commented 4 years ago

This is looking good in the logs. Going to close!