Investigate no results for required links scans

GSA / site-scanning

The central repository for the Site Scanning program

https://digital.gov/site-scanning

11 stars 2 forks source link

Investigate no results for required links scans #942

Open gbinal opened 2 months ago

gbinal commented 2 months ago

Note current in-use snippets here...

E.g. on:

accessibility - va.gov
accessibility and privacy policy - drivethru.gsa.gov

akuny commented 2 months ago

The required links scan has been rebuilt to use a puppeteer Page instance and DOM queries instead of regex searching the response body as raw text.

Testing indicates that the change works for the two cases above. It's deployed in this PR and data will be available if we create a new snapshot after tonight's scans run: https://github.com/GSA/site-scanning-engine/pull/325

gbinal commented 2 months ago

This is better but still is occuring some.

Examples:

several - developers.login.gov
accessibility - deeoic.dol.gov
about - calm.gsa.gov
several - bea.gov
...
accessibility - va.gov (though I get why)
about, usa.gov - studentaid.gov (though I get why)
accessibility, privacy policy - conexus.gsa.gov (though I get why)
accessibility, Espanol (and more) - faq.ssa.gov (though I get why)
about - https://ecc.nist.gov/ (though I get why)
several - answers.hud.gov (though I get why)
Privacy Policy - https://appointment.treasury.gov (though I get why)

akuny commented 1 month ago

bea.gov: most recent snapshot has no data for the required links fields, but local scans turn up the following:

"requiredLinksScan":
    {
        "requiredLinksUrl": "about,fear,foia,privacy,usa.gov",
        "requiredLinksText": "budget and performance,no fear act,foia,usa.gov"
    }

developers.login.gov: includes the data below in local scans and the most recent snapshot

required_links_url: fear,foia,usa.gov |
required_links_text: accessibility,no fear act,foia,inspector general,privacy policy,usa.gov

deeoic.dol.gov: required_links_text field includes "accessibility" in local scans and the most recent snapshot

calm.gsa.gov: try loading this page with Chrome devtools open: the "About link" isn't there in the 200 response body. It may be added by client-side scripting after puppeteer has evaluated the page (see below)

Screenshot 2024-05-10 at 9 11 17 AM

akuny commented 1 month ago

The most recent prod scans for bea.gov get a HTTP 403 Forbidden response, which is likely why the required links aren't showing up as expected versus when pulling that site up in a browser manually.

gbinal commented 1 month ago

fair enough - thank you!!

You've researched every example I have found so far. I need to update our documentation to reflect these lessons learned but also will see if I can find any more to try to test, but as best as I can tell, in every case, it's not been on our end.