8435 omb_idea sites in the scan - same numbers as in the index (why fewer than source file)
First off, there's only 9557 unique values in the source file. that leaves a delta of 1122. I actually see 1152 as the count missing from the index, but that's a result of duplicates.
880 end in .mil, so there's a filtering problem somewhere that we need to investigate.
Roughly all but 75 of the rest end in .com, .org, etc.
Looking through those, it's apparent that they are being properly filtered out.
The .mil issue needs addressing, but the rest seems legit.
Failed scans (why for each?)
primary - 979
accessibility - 1306
DNS - 0
Not Found - 595 (142 despite the site being live, so why would it fail differently?)
performance - 672
robots.txt - 484
security - 720 (investigate in particular)
sitemap.xml - 598
1190 aren't live (why?)
7072 don't report a CMS (why?) - [note - recent improvement now makes it 6903]
Requests for agencies to improve:
44 have odd or incorrect media types
5 are xml or JSON files (likely APIs) and perhaps shouldn't be included in the website list
1905 do not return 404 errors correctly (site=live, target_url_404_test=false)
For the above analysis, putting this in a different issue.
23 sites refuse a connection (are blocking us?).
accpiv.treasury.gov
accsso.treasury.gov
aom.giss.nasa.gov
bqs.usgs.gov
d9.qa.jimmycarterlibrary.gov
d9.qa.obamalibrary.gov
data.giss.nasa.gov
gacp.giss.nasa.gov
gcss-dime.giss.nasa.gov
gipsyx.jpl.nasa.gov
glory.giss.nasa.gov
icp.giss.nasa.gov
leadpaint.sc.egov.usda.gov
probes.pw.usda.gov
pubs.giss.nasa.gov
rsupport1.fws.gov
sftp1.phmsa.dot.gov
sso.treasury.gov
support.ntsb.gov
thredds1.pfeg.noaa.gov
train.empowhr.gov
tsrauat.ofac.treas.gov
ursdapp.urs.eosdis.nasa.gov
11 have the connection reset. It appears that most all of these are not live sites.
cra.cdc.gov
eroc.ssologin1.fms.treas.gov
mobileguard.usmarshals.gov
npp.cdc.gov
npp.glb.cdc.gov
podassistonprem.cdc.gov
preprod.vta.va.gov
stopfakes.gov
survey.ole.justice.gov
vcgate.video.srs.gov
vcgate02.video.srs.gov
2 have invalid SSL certificates.
ifw7asm-orcldb.fws.gov
itcontacts.kcnsc.doe.gov
207 have a DNS resolution error. It appears to most all of these aren't live sites (at the exact target URL; in some cases, it's b/c www. is required or the like).
accesstocare.va.gov
aci.nichd.nih.gov
aff.gov
aidscapeuat.usaid.gov
ambismobile.niaid.nih.gov
ambismobileqa.niaid.nih.gov
ameslab.gov
amoc-css.cbp.dhs.gov
anl.gov
aphis.usda.gov
archive.usgs.gov
aspprox1.epa.gov
atlashep.anl.gov
aw.nrel.gov
bcgcfm5.ncifcrf.gov
benefits-tool-beta.usa.gov
benefits.vba.va.gov
beoc.ccs.esmo.nasa.gov
beta.onrr.gov
bids.state.gov
bitool.ed.gov
blog.ninds.nih.gov
bop.gov
bptwai.fms.treas.gov
broadbandsearch.cert.sc.egov.usda.gov
broadbandsearch.sc.egov.usda.gov
bscdev.nidcd.nih.gov
bts.gov
businessdefense.gov
caregiverfinanciallegal.va.gov
cbmp.nichd.nih.gov
cbrfc.noaa.gov
ccdor.research.va.gov
cdlisws.cdlis.dot.gov
cdscc.nasa.gov
cert.eauth.usda.gov
cf.gsfc.nasa.gov
cloudfront.sba.gov
clubs.larc.nasa.gov
cms8.fhwa.dot.gov
cnrfc.noaa.gov
communicationstrackingradar.jpl.nasa.gov
compliance-viewer.18f.gov
compservices.tva.gov
coop.vpn.cttso.gov
cops.fas.gsa.gov
correlogo.ncifcrf.gov
cpc.omao.noaa.gov
cpsearch.fas.gsa.gov
crfs.cr.usgs.gov
crtpfm1.ncifcrf.gov
crtpfm2.ncifcrf.gov
cryosparc.cancer.gov
csat.dhs.gov
ctp.lbl.gov
ctpat.cbp.dhs.gov
data.exim.gov
data.fra.dot.gov
dd.pppl.gov
docwebta.eas.commerce.gov
dotcms.fra.dot.gov
drupal-prod.ntp.niehs.nih.gov
e.arsnet.usda.gov
eauth.usda.gov
ecc-project.sandia.gov
edgarcompany.sec.gov
eform1.ferc.gov
ehrincentives.cms.gov
ellis.tva.gov
emsl.pnnl.gov
eshelp.opm.gov
faa.gov
fdicconnect.gov
feedback.usa.gov
filermanagement.edgarfiling.sec.gov
fleetd.gsa.gov
foiaonline.gov
foiarequest.epa.gov
fsgb.gov
ftajira.ad.dot.gov
gaponline.epa.gov
gdscc.nasa.gov
geo.arc.nasa.gov
gis.boemre.gov
gis.nlm.nih.gov
gisc-washington-cprk.ncep.noaa.gov
giss.nasa.gov
globaldossier.uspto.gov
grc.nasa.gov
helaacd.nih.gov
historydms.hq.nasa.gov
hlwpi-csprdmz.nhlbi.nih.gov
howard.nichd.nih.gov
iat.gov
icrc.nci.nih.gov
idn.earthdata.nasa.gov
idp.cancer.gov
idp.sujana09.identitysandbox.gov
idp.vivek.identitysandbox.gov
imagej.cit.nih.gov
imagingtherapy.nibib.nih.gov
inflammatory.nhlbi.nih.gov
intelligencecareers.gov
jccs.gov
jlevinlab.nichd.nih.gov
katana.sba.gov
landlook.usgs.gov
latinawomen.larc.nasa.gov
lforms-formbuilder.lhcaws.nlm.nih.gov
lgdfm5.ncifcrf.gov
lgrd.nichd.nih.gov
lhc-formbuilder.lhc.lhcaws.nlm.nih.gov
lippincottschwartzlab.nichd.nih.gov
lps.gov
mas.nasa.gov
mastercalendar.ncirc.gov
mdscc.nasa.gov
mecfs.ctss.nih.gov
minorityinternships.energy.gov
mishoe.nhlbi.nih.gov
mobilemi.ent.usda.gov
move.va.gov
mslabs.sefsc.noaa.gov
multihazards.sciencebase.gov
myapps-val.fda.gov
natweb-r53.usgs.gov
ncc-gtt-node9.epa.gov
ncrc.gov
ndc.sciencebase.gov
ned.usgs.gov
nepassist.epa.gov
nhales.ctss.nih.gov
niaid.nih.gov
nidcddev.nidcd.nih.gov
nidcdstg.nidcd.nih.gov
nidcdtest.nidcd.nih.gov
nomads.weather.gov
npdev.nidcd.nih.gov
npstg.nidcd.nih.gov
nro.gov
ns.cms.gov
ntc.blm.gov
nwrfc.noaa.gov
oauth.alcf.anl.gov
oigdr.hq.nasa.gov
olga.er.usgs.gov
onlineforms.edgarfiling.sec.gov
opensearch-ui.earthdata.nasa.gov
osac.gov
parkinsontrial.ninds.nih.gov
pave.hud.gov
pay.va.gov
pdev.grants.gov
pedmatch-int.nci.nih.gov
pedmatch.nci.nih.gov
phasespace-explorer.niaid.nih.gov
physics-prod-acsf.cancer.gov
portal.edgarfiling.sec.gov
portal.eos.gsa.gov
portal.nasa.gov
protectyourmove.gov
psp.fmcsa.dot.gov
public-repo.ci.history.state.gov
qabot.usgs.gov
rnajunction.ncifcrf.gov
rockyags.cr.usgs.gov
rpif.jpl.nasa.gov
sbageotask.larc.nasa.gov
sbrsfa.velo.pnnl.gov
sdms.cr.usgs.gov
seaway.dot.gov
sems.epa.gov
sgisnidillr.acl.gov
sierrafire.cr.usgs.gov
smm.nichd.nih.gov
solardecathlon.gov
sp.arsnet.usda.gov
spacestem.nasa.gov
sparc.usda.gov
sparq.doleta.gov
spdf1.sci.gsfc.nasa.gov
spsrch.cit.nih.gov
srs.gov
sso-east.csp.noaa.gov
sso-north.csp.noaa.gov
sso-west.csp.noaa.gov
stg-asprportal.hhs.gov
stg-asprwg.hhs.gov
stg-asprwgpublic.hhs.gov
stg-mysitendms.hhs.gov
stg-ndms.hhs.gov
tenure.nichd.nih.gov
textrous.irp.nia.nih.gov
train.hris.va.gov
ttb.gov
ugo.nichd.nih.gov
uis.doleta.gov
urban.wr.usgs.gov
visn2.va.gov
visn23.va.gov
volpe.dot.gov
vtsave.nlm.nih.gov
wa1.vpn.oig.treas.gov
www-sdss.fnal.gov
www-web-search-alx.uspto.gov
y4y.ed.gov
zfig.nichd.nih.gov
107 have unknown errors.
afadvantage.gov
alert.nih.gov
animalresearch.nih.gov
answers.usgs.gov
apply.fbijobs.gov
apps-beta.nationalmap.gov
apps.fs.usda.gov
ardf.wr.usgs.gov
arrtmc.er.usgs.gov
asap.gsa.gov
atms.fleta.gov
backupcare.ors.nih.gov
beta.fpds.gov
cdp.dhs.gov
childcare.ors.nih.gov
clinicianportal.cc.nih.gov
cmdp.epa.gov
csi-rt.cbp.dhs.gov
csi-rt2.cbp.dhs.gov
dats.ors.od.nih.gov
dems.ors.od.nih.gov
denvervpn.fmshrc.gov
dhsadvantage.gsa.gov
dmcs.ors.od.nih.gov
dmcseddebt.ed.gov
dmms.ors.od.nih.gov
does.ors.od.nih.gov
dohs.ors.od.nih.gov
dseis.od.nih.gov
dsid.od.nih.gov
dtts.ors.od.nih.gov
dvr.ors.od.nih.gov
e-verify.uscis.gov
ecos-beta.fws.gov
ecos-training.fws.gov
ecos.fws.gov
eeo.oar.noaa.gov
effectivehealthcare.ahrq.gov
efileqa.fara.gov
emaps.ed.gov
entptest.hud.gov
esbl.nhlbi.nih.gov
everify.uscis.gov
ezaudit.ed.gov
faf.ornl.gov
finance.ocfo.gsa.gov
foiltheflu.nih.gov
fpd.gsfc.nasa.gov
fs.usda.gov
fsa-fms.ed.gov
fsa-fmstest2.ed.gov
giitest.dhs.gov
gsaadvantage.gov
gsaelibrary.gsa.gov
hqvpn.fmshrc.gov
idbadge.nih.gov
iee.tva.gov
igt.fiscal.treasury.gov
inventions.nih.gov
ita.data.commerce.gov
its.gov
jpl.nasa.gov
lakeinfo.tva.gov
livelink.nida.nih.gov
lymphochip.nih.gov
marketplace.fedramp.gov
medarts.nih.gov
mpai.ksc.nasa.gov
nasatoms.gsfc.nasa.gov
nccd.cdc.gov
niehs.nih.gov
nlecatalog.ed.gov
noaa.data.commerce.gov
nomercury.nih.gov
npdes-ereporting.epa.gov
ntrl.ntis.gov
oerstaff.od.nih.gov
ohsrp.nih.gov
onhir.gov
orauportal.fda.gov
orautest.fda.gov
parking.nih.gov
paulsimonchicago.jobcorps.gov
phgkb.cdc.gov
pub-lib.jpl.nasa.gov
qoca.jpl.nasa.gov
recoveryswapshop.ird.appdat.jsc.nasa.gov
recoveryswapshop.jsc.nasa.gov
secure.ssa.gov
shemesh.larc.nasa.gov
shm.cc.nih.gov
shuttle.nih.gov
shuttle.od.nih.gov
simon.er.usgs.gov
step.nih.gov
tcr.sec.gov
topsorder.ftsbilling.gsa.gov
topsordercert.ftsbilling.gsa.gov
userfees.fda.gov
usmcservmart.gsa.gov
vipssp.dmcseddebt.ed.gov
water.weather.gov
wellnessnews.ors.nih.gov
wheat.pw.usda.gov
wildhorsesonline.blm.gov
wise.er.usgs.gov
workfamilymonth.ors.nih.gov
543 time out. Some substantial number of these are live sites, but in our experience meta or client-side redirects (e.g. HTML code for redirecting a site, instead of a server-side code) are often what our headless browser is failing on.
Take 4:
Specific sites for the below...
Take 3:
Requests for agencies to improve:
Take 2:
Areas sites could improve:
========
Along the lines of #838 Working here
Note the original cisa data here