GSA / site-scanning

The central repository for the Site Scanning program
https://digital.gov/site-scanning
11 stars 2 forks source link

Note long tail of potential scan issues to investigate #202

Open gbinal opened 2 years ago

gbinal commented 2 years ago

These issues really aren't pressing, so they can wait for another time...

akuny commented 2 years ago

Here are some notes:

Some URLs have unknown MIMEtypes. The "unknown" value means that there was not a content-type header included with the HTTP response.

Some URLs have multiple, conflicting MIMEtypes. There are no duplicate MIMEtypes in recent scan.

Some URLs have a final_url_live status of False, but still have mimetypes other than html These are responses with JSON, XML, or plain text bodies, or with no content-type header (see above).

Some URLs have a 2xx or 3xx status code but still say final_url_live = false. Same with robots.txt and sitemap.xml There are 23 URLs that have a final_url_status_code value of 202 and a final_url_live value of FALSE. Currently, the final_url_live value is set to TRUE only if the website's response code is exactly 200.

Some URLs have dap_detected=false, but values in dap_parameters_final_url_agency There are 39 URLs that do not contain make outbound dap.digitalgov.gov/Universal-Federated-Analytics-Min.js, but which do not reference tag UA-33523145-1 in particular (https://www.usda.gov/digital-strategy/analytics/plays).

There’s 7037 that fail the 404 test. I haven’t been able to find an example of where that was wrong, so it really might be a sign of just how many sites do this, but wow. For some of these cases, the bogus URL created called during the 404 test is automatically redirected to a valid final URL by the server.