18F / site-scanning

The code base for the first Site Scanning engine
https://digital.gov/site-scanning
18 stars 9 forks source link

Stand up the pilot USWDS code and run a trial scan #26

Closed gbinal closed 5 years ago

gbinal commented 5 years ago

As a project team member, I'm interested in researching the code that we already have access to for USWDS scanning in order to evaluate its potential and flaws.

There's one (possibly two) pilot code projects to see:

top-level domains that are using USWDS:

CitizenScience.gov ClinicalTrials.gov code.mil cloud.gov cbp.gov dds.mil dnfsb.gov commerce.gov dhs.gov dietaryguidelines.gov dotgov.gov epa.gov fca.gov fcsic.gov fec.gov ffb.gov fpc.gov fedramp.gov foia.gov gsa.gov healthcare.gov imls.gov iawg.gov irs.gov itdashboard.gov login.gov manufacturing.gov medicaid.gov move.mil mymedicare.gov nih.gov floodsmart.gov opioids.gov performance.gov plainlanguage.gov pclob.gov search.gov sba.gov stopbullying.gov upremecourt.gov tsa.gov usagm.gov usaid.gov usda.gov dol.gov treasury.gov va.gov usds.gov flra.gov uscis.gov uscourts.gov usich.gov unlocktalent.gov usa.gov usaid.gov usajobs.gov usaspending.gov usgs.gov vote.gov whitehouse.gov worker.gov

vickimcfadden commented 5 years ago

From Eric Mill - The USWDS scanner just checks for the existence of a class that starts with “usa-”.

From Dan Williams (USWDS Product Owner) - Can you tell me a little bit about some of the other scanners that have been built in the past (e.g. USWDS, accessibility, third-party services)? Eric built the USWDS scanner and showed it to the USWDS team, and said that the results weren’t great (lots of false positives) and they probably shouldn’t use it. The USWDS team was too busy with production to spend time on that, and Eric didn’t have more time for it, so that fell through the cracks. The ran it once, to check the output, and never used it again.

timothy-spencer commented 5 years ago

@vickimcfadden, do you have a list of .gov sites which are not USWDS compliant as well? I'd love to have some good negatives to test against too.

vickimcfadden commented 5 years ago

I don't think any of these are using USWDS (at least, if they are, we don't know about it):

state.gov energy.gov transportation.gov doi.gov ssa.gov nsf.gov nasa.gov fema.gov eeoc.gov nps.gov ftc.gov dea.gov cpsc.gov gao.gov osha.gov fdic.gov faa.gov secretservice.gov fws.gov census.gov federalreserve.gov fcc.gov fbi.gov cdc.gov fda.gov

timothy-spencer commented 5 years ago

OK. I have created a first stab at a uswds scanner. It currently lives here: https://scanner-ui-exhausted-swan.app.cloud.gov/api/v1/scans/uswds2/

It scrapes the top level page and counts particular elements that may indicate USWDS being in effect. It then adds them up for a final score. This uses roughly the same approach as the original uswds scanner experiment that is in the domain-scan repo, with a few more elements that I came up with after looking at some of the sites. You can get all the scores with a script like this, which takes about 8 minutes to run:

#!/bin/sh
# 
# This script will output a json document that has all of the USWDS scores for
# all of the domains that the scanner knows about.
# 

echo '['
curl -s https://scanner-ui-exhausted-swan.app.cloud.gov/api/v1/scans/uswds2/ | jq -r '.[] | .scan_data_url' | while read line ; do
    curl -s "$line" | jq -c '{total_score: .total_score, domain: .domain}'
done
echo ']'

When evaluating it against the list of good domains, it seems to have a ~50% false negative rate:

laptop:scanners$ fgrep -f /tmp/gooddomains /tmp/scores.json | grep score..0, | wc -l
      33
laptop:scanners$ wc -l /tmp/gooddomains 
      61 /tmp/gooddomains
laptop:scanners$ 

When evaluating it against the list of bad domains, it seems to be good with a 0% false positive rate:

laptop:scanners$ fgrep -f /tmp/baddomains /tmp/scores.json | grep -v score..0, | wc -l
       0
laptop:scanners$ 

More research will be needed to make the signal stronger, but this completes this issue.

vickimcfadden commented 5 years ago

I found that these websites scored high (over 50) and were not on the USWDS list that they track: code.gov consumeraction.gov floodsmart.gov foodsafety.gov forestsandrangelands.gov gobiernousa.gov nel.gov osac.gov trumanlibrary.gov challenge.gov / challenges.gov forms.gov everify.gov nixonlibrary.gov presidentialserviceawards.gov businessusa.gov digitalgov.gov fmcs.gov hhsoig.gov aoc.gov doleta.gov cio.gov sigtarp.gov insurekidsnow.gov employer.gov dea.gov kids.gov doleta.gov (edited)

vickimcfadden commented 5 years ago

USWDS feedback on scans ☝️ https://docs.google.com/spreadsheets/d/11rbvSc2JKfRw1B75xHkNWPFCCFhjY8gLWu2WUv3NHrc/edit#gid=0

thisisdano commented 5 years ago

From that list, I'd consider the following to be USWDS sites:

floodsmart.gov employer.gov foodsafety.gov forestsandrangelands.gov trumanlibrary.gov challenge.gov presidentialserviceawards.gov doleta.gov insurekidsnow.gov sigtarp.gov cio.gov nixonlibrary.gov code.gov

There are a few redirects:

businessusa.gov → usa.gov consumeraction.gov → usa.gov forms.gov → usa.gov gobiernousa.gov → usa.gov kids.gov → usa.gov nel.gov → nesr.usda.gov challenges.gov → challenge.gov digitalgov.gov → digital.gov hhsoig.gov → oig.hhs.gov

And a few sites that have some surface characteristics but may not warrant inclusion. Typically, these show use of Source Sans Pro, and possibly the gov banner.

everify.gov dea.gov osac.gov fmcs.gov aoc.gov

vickimcfadden commented 5 years ago

This is fantastic feedback @thisisdano! Hopefully you found a few new sites to add to your list and we'll work to make some improvements for the next version we show you.

gbinal commented 5 years ago

google doc with ^^^ notes, for convenience

vickimcfadden commented 5 years ago

@thisisdano when we searched for subdomains, we found 24 more sites that may be USWDS implementations (scores over 100)

(also captured in your google sheet - https://docs.google.com/spreadsheets/d/11rbvSc2JKfRw1B75xHkNWPFCCFhjY8gLWu2WUv3NHrc/edit#gid=0)

agile-bpa.18f.gov beta.trade.gov eng-hiring.18f.gov emerging.digital.gov maps.certify.sba.gov magazine.medlineplus.gov methods.18f.gov msigateway.larc.nasa.gov nyw.cap.gov openopps.digitalgov.gov openopps.usajobs.gov product-guide.18f.gov pra.digital.gov public-sans.digital.gov release.nass.usda.gov resources.data.gov salesforce.trade.gov summit.digitalgov.gov tailored.fedramp.gov strategy.data.gov tech.gsa.gov vec.gsa.gov wdolhome.sam.gov ww3.fca.gov