18F / site-scanning

The code base for the first Site Scanning engine
https://digital.gov/site-scanning
18 stars 9 forks source link

Some API endpoints are breaking #628

Closed gbinal closed 4 years ago

gbinal commented 4 years ago

https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gov/ breaks (server error - 500), though https://site-scanning.app.cloud.gov/api/v1/scans/dap/ works. Not sure what's up or how many other endpoints are broken, but we should look into it.

alexbielen commented 4 years ago

@gbinal Quick question: What is the expected behavior for redirects?

18f.gov redirects in the following way to 18f.gsa.gov.

$ wget 18f.gov
URL transformed to HTTPS due to an HSTS policy
--2020-08-06 16:48:08--  https://18f.gov/
Resolving 18f.gov (18f.gov)... 13.225.214.23, 13.225.214.95, 13.225.214.97, ...
Connecting to 18f.gov (18f.gov)|13.225.214.23|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://18f.gsa.gov/ [following]
--2020-08-06 16:48:08--  https://18f.gsa.gov/
Resolving 18f.gsa.gov (18f.gsa.gov)... 99.84.114.65, 99.84.114.103, 99.84.114.89, ...
Connecting to 18f.gsa.gov (18f.gsa.gov)|99.84.114.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

18f.gsa.gov appears to work as expected.

gbinal commented 4 years ago

sooooo, I don't think that's at least the problem here. Can you get any website specific api call to work for the dap scan? e.g. https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gsa.gov/ and https://site-scanning.app.cloud.gov/api/v1/scans/dap/fbi.gov/ don't work for me, either.

alexbielen commented 4 years ago

Gotcha, and yeah it looks like https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gov/ (the redirect site) works for me now.

All of the URLs below give me a 200, but the https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gov/ did not work four days ago when I looked at it. So it looks like an intermittent issue unfortunately, which is harder to debug. I'll see if we have any kind of logs set up, and take a look there. If we don't I'll have to set those up in order to take a closer look at what might be happening.

18f.gov

https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gov/ gives me:

{
    "domain": "18f.gov",
    "scantype": "dap",
    "domaintype": "Federal Agency - Executive",
    "organization": "18F",
    "agency": "General Services Administration",
    "data": {
        "dap_detected": true,
        "dap_parameters": "agency=GSA&subagency=18F",
        "domain": "18f.gov",
        "status_code": 200
    },
    "scan_data_url": "https://s3-us-gov-west-1.amazonaws.com/cg-852a6196-0fdb-4a01-a16f-6c24379722cb/dap/18f.gov.json",
    "lastmodified": "2020-08-09T22:57:08Z"
}

18f.gsa.gov

https://site-scanning.app.cloud.gov/api/v1/scans/dap/18f.gsa.gov/ gives me:

{
    "domain": "18f.gsa.gov",
    "scantype": "dap",
    "domaintype": "",
    "organization": "",
    "agency": "General Services Administration",
    "data": {
        "dap_detected": true,
        "dap_parameters": "agency=GSA&subagency=18F",
        "domain": "18f.gsa.gov",
        "status_code": 200
    },
    "scan_data_url": "https://s3-us-gov-west-1.amazonaws.com/cg-852a6196-0fdb-4a01-a16f-6c24379722cb/dap/18f.gsa.gov.json",
    "lastmodified": "2020-08-09T12:46:40Z"
}

FBI.gov

https://site-scanning.app.cloud.gov/api/v1/scans/dap/fbi.gov/ gives me:

{
    "domain": "fbi.gov",
    "scantype": "dap",
    "domaintype": "Federal Agency - Executive",
    "organization": "FBI",
    "agency": "Department of Justice",
    "data": {
        "dap_detected": false,
        "dap_parameters": "",
        "domain": "fbi.gov",
        "status_code": 200
    },
    "scan_data_url": "https://s3-us-gov-west-1.amazonaws.com/cg-852a6196-0fdb-4a01-a16f-6c24379722cb/dap/fbi.gov.json",
    "lastmodified": "2020-08-09T22:57:08Z"
}
alexbielen commented 4 years ago

@gbinal can you confirm that I have access to this https://site-scanning.app.cloud.gov/?

In cloud.gov I only have access to this endpoint https://scanner-ui-chipper-tiger-do.app.cloud.gov/ so I can only see logs for this instance.

alexbielen commented 4 years ago

@gbinal blocked on this until I have access to the https://site-scanning.app.cloud.gov/ environment

gbinal commented 4 years ago

roger that - sorry. I'll poke the team we're waiting on.

alexbielen commented 4 years ago

Unblocked on this.

alexbielen commented 4 years ago

To close this one we will need to add application logs to investigate the cause of the issue. I captured that work in #656.

alexbielen commented 4 years ago

I added logging to the scans API.

There are two issues I've discovered so far: 1) There are >1 scans for domain, scantype, and date combination. 2) There are 0 scans for a domain, scantype, and date combination.

>1 scans

When I call the 18f.gov URL it returns a 500 and I now have this convenient log message, which shows that there are two scans for 18F.gov.

18f.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

fbi.gov

16:43:45.526: [APP/PROC/WEB.0] 2020-08-18 20:43:45,525 scanner_ui.api.views ERROR    Scan length was 2. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': '18f.gov'}}]}}, 'sort': ['domain']}

I haven't confirmed the root cause of the multiple scans yet but it's likely an issue with how we're dealing with dates in the API.

0 scans

ojjdp.ojp.gov

16:47:39.578: [APP/PROC/WEB.0] 2020-08-18 20:47:39,578 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'ojjdp.ojp.gov'}}]}}, 'sort': ['domain']}

healthreach.wip.nlm.nih.gov

18:31:35.878: [APP/PROC/WEB.0] 2020-08-18 22:31:35,877 scanner_ui.api.views ERROR    Scan length was 0. Expected 1 scan. {'query': {'bool': {'filter': [{'term': {'domain': 'healthreach.wip.nlm.nih.gov'}}]}}, 'sort': ['domain']}

I have discovered the cause the zero scans issue. This happens when the domain, scantype, date combination doesn't run.

This requires a bit of an explanation: The getdomains.sh script, gets all of the domains that we care about (https://github.com/GSA/data/raw/master/dotgov-domains/current-federal.csv) and (https://github.com/GSA/data/raw/master/dotgov-websites/pulse-subdomains-snapshot-06-08-2020-https.csv) and merges them together. It then splits them into separate CSV files ordered alphabetically by domain.

xaa.csv
xab.csv
xac.csv
xad.csv
...
xai.csv

Each of these files is then spun up as separate tasks on Cloud.gov when CircleCI runs the nightly scan job. The issue is that if there isn't enough memory quota, we can't spin up a new task. So the domains contained in xae.csv through xai.csv never actually run.

The domain ojjdp.ojp.gov is in file xag.csv and the domain healthreach.wip.nlm.nih.gov is in xaf.csv.

Possible Fixes

For now we want to go with option 1.

gbinal commented 4 years ago

This is great detecting. Thank you!!!

For a next step, let's turn off the lighthouse and third-party scans.

alexbielen commented 4 years ago

Closing this. I created a more specific issue in #672.