GSA / govt-urls

Most government websites end in .gov or .mil, but many do not. This repo contains USA.gov's list of public government domains and URLs that don't end in .gov or .mil.
https://search.gov/developer/govt-urls.html
216 stars 55 forks source link

A few potential data-quality improvements #11

Closed benbalter closed 8 years ago

benbalter commented 8 years ago

Chatting with @ErikSArnold a few weeks back, I mentioned that I have a script to validate the domains listed here before I vendor them into GMan.

I rely on the data, so I created a quick script to reconcile the two lists, in hopes of contributing some of those upstream improvements back. You can see the full output below.

To note, I suspect some of the differences may be intentional. I purposely want to exclude educational domains or commercial hosting services, and look only at domains, not sub paths if two government entities share a server.

Glad to answer any questions, and hope the information helps.

$ script/reconcile-us
I, [2015-10-14T14:43:28.396630 #8665]  INFO -- : Starting with 11374 domains
I, [2015-10-14T14:43:32.422899 #8665]  INFO -- : Filtered down to 11343 normalized domains
I, [2015-10-14T14:43:32.447826 #8665]  INFO -- : Found 49 domains not on the USA.gov list
Here's the list of missing domains:

---
usagovAK:
- cityofsherwood.net
usagovAL:
- shelbycountyalabama.com
usagovAR:
- aragriculture.org
- arcommunities.org
- arfamilies.org
- arhomeandgarden.org
- arnatural.org
- educationinarkansas.com
- kidsarus.org
usagovAS:
- amsamoatourism.com
usagovAZ:
- mcldaz.org
usagovCA:
- countyofventura.org
- fortbragg.com
- portofsandiego.org
- sbceoportal.org
- sesd.org
- sgch.org
- solvangusa.com
- sonoma-county.org
- srcs.org
usagovCO:
- centennialcolorado.com
- cityofgolden.net
- greeleygov.com
- mountain-village.co.us
- townofsuperior.com
usagovCT:
- clintonct.com
- connquest.com
- oxford-ct.com
- swrpa.org
usagovDC:
- dccouncil.us
- wmata.com
usagovDE:
- odessadelaware.com
usagovFL:
- fgdl.org
- floridasterling.com
- myfloridahistory.org
- volunteerflorida.org
- workforceflorida.com
usagovGA:
- atlantaregional.com
usagovHI:
- hawaii.sdp.sirsi.net
usagovIA:
- loganiowa.com
- shellrockiowa.org
usagovIL:
- bradleyil.org
- chicagoheights.net
- cumtd.com
- historyillinois.org
- mattoonillinois.org
- mundelein.org
- murphysboro.com
- sirpdc.org
- toi.org
- transitchicago.com
- volz.org
- watsekacity.com
usagovIN:
- townofdyer.com
usagovKS:
- greeleycountygovernment.org
- nemaha.kansasgov.com
- tongie.org
usagovKY:
- kfcyumcenter.com
usagovLA:
- angolamuseum.org
- atchafalaya.org
- crawfish.org
- ebrso.org
- groupbenefits.org
- la-kidmed.com
- labenfa.com
- lachiefs.org
- lacourtreporterboard.com
- lacpra.org
- laddc.org
- laeggs.com
- lalb.org
- lasc.org
- laspc.com
- laworks.net
- lma.org
- loni.org
- louisianacda.com
- louisianaeconomicdevelopment.com
- louisianaseafood.com
- louisianataxfree.com
- lpb.org
- lsba.org
- lsbes.org
- lsbid.org
- lsbmt.org
- lsli.org
- lsp.org
- lus.org
- lusfiber.com
- portgbr.com
- trsl.org
usagovMA:
- publiccounsel.net
usagovMD:
- bism.org
- mtnlakepark.us
- salisburyfd.com
- sudlersville.org
- westernmarylandcfc.org
usagovME:
- brewerme.org
- scarborough.me.us
usagovMI:
- stclairshores.net
usagovMN:
- boreal.org
- cityofrogers.org
- cityofsebeka.com
- hancockmn.org
- lakewoodmn.com
- maplelakemn.org
- metrotransit.org
- mywabana.com
- oronocotownship.com
- prairieagcomm.com
- swrdc.org
- threeriversparkdistrict.org
- townofhassan.com
- waterfordtownship.wikifoundry.com
usagovMO:
- bonneterre.net.
- claycogov.com
- kcmo.org
- straffordmissouri.org
- villageofclaycomo.com
usagovMS:
- cityofboonevillems.com
- clarksdalewebinfo.com
- lelandms.org
- masnetwork.org
- mmlonline.com
- oceansprings.org
- thecityofcolumbusms.org
- wavelandcity.com
- wessonms.org
usagovMT:
- froidmt.com
- hotsprgs.net
- hotspringsmt.net
- lakecounty-mt.org
usagovNC:
- albemarlecommission.org
- berrytownecrafts.com
- bikesafenc.com
- bmcnc.org
- driving95.com
- emspic.org
- encsd.net
- everywomannc.com
- healthycarolinians.org
- i-85yadkinriver.com
- jennettespier.net
- jfkadatc.net
- jirdc.org
- mattamuskeetlodge.com
- mountainfair.org
- murdochcenter.org
- museumofthealbemarle.com
- naturalsciences.org
- ncadfp.org
- ncagfairs.org
- ncair.org
- ncatlasrevisited.org
- ncatp.org
- nccancer.com
- nccivilwar150.com
- nccoastalreserve.net
- ncdmf.net
- ncdrought.org
- ncdsca.org
- ncecho.org
- ncfacilitymanagement.net
- ncfhp.org
- ncforeclosurehelp.org
- ncforestassessment.com
- ncfriendsofagriculture.org
- nchealthystart.org
- nchistoryday.org
- ncicu.org
- ncknows.org
- nclifetimeincome.org
- ncmarkers.com
- ncnewbornhearing.org
- ncnhtf.org
- nconemap.net
- ncpanbranch.com
- ncsicklecellprogram.org
- ncstatesurplus.com
- ncstrokeregistry.com
- ncveterans.com
- ncwaterquality.org
- ncwelldriller.org
- nczoo.org
- newhirereporting.com
- onencnaturally.org
- savewaternc.org
- sehsr.org
- startwithyourheart.com
- tryonpalace.org
- volunteernc.org
usagovND:
- finleynd.com
- granvillend.com
- ndowlicensing.com
usagovNH:
- haverhill-nh.com
usagovNJ:
- absecon-newjersey.org
- avon-by-the-sea.com
- brooklawn.us
- njtransit.com
- rutherfordems.org
- waterfordtwp.com
usagovNM:
- sjunitedway.org
usagovNV:
- walknevada.com
usagovNY:
- southbristol.org
- townofdover.us
- townofgalway.org
- tullyny.org
- villageflowerhill.com
usagovOH:
- shawneehillsoh.com
usagovOR:
- fallscity.org
- nwsds.org
- osbar.org
- sdao.com
usagovPA:
- abingtontownship.us
- doorkickers.org
- dushore.com
- elizabethtownship.org
- hsp.org
- mckeesrocks.com
- pacouncilonthearts.org
- padmv.org
- psats.org
- psp-hemc.org
- sapdc.org
- telfordborough.com
- trainerborough.org
- wcaln.org
- westviewborough.com
usagovRI:
- rhodeislandhousing.org
usagovSC:
- bonneausc.com
- scattorneygeneral.com
- scattorneygeneral.org
- scstatehouse.net
- townofgraycourt.net
usagovSD:
- beadlecounty.org
- cityofhotspringssd.org
- sdonecall.com
usagovTN:
- washingtoncountytn.com
usagovTX:
- bunkerhill.net
- cedarhilltxgov.org
- ci.donna.lib.tx.us
- cityofbeaumont.com
- sanangelotexas.us
- socorrotexas.org
usagovUT:
- marysvale.org
- saratoga-springs.net
usagovVA:
- culpeper.to
- motorcycleva.com
- townofkilmarnockva.com
- vaports.com
usagovVT:
- danvillevt.com
- guilfordvt.org
- jamaicavermont.org
- peacham.net
- rutlandcity.com
- shorehamvt.org
- uvlsrpc.org
usagovWA:
- aberdeeninfo.com
- harringtonbiz.com
- othellowashington.us
- starbuckwa.com
- townofrosalia.org
- townofwinthrop.com
- wsctc.com
usagovWI:
- lacrossecounty.org
- scottwi.com
- sisterbay.com
- townofnorway.org
usagovWV:
- cityoffairmontwv.com
- wvdob.org
usagovWY:
- house.mn
- senate.mn
I, [2015-10-14T15:18:25.532592 #8665]  INFO -- : Calling out 395 rejected domains
Here are the rejected domains and why they were rejected (excluding locality regexs):

---
unresolvable:
- abilityone.fed.us
- access-board-members.gov
- anitaiowa.com
- apps.gov
- aqi.gov
- arcticgas.gov
- atkinson-me.org
- b-ville.com
- biosecurityboard.gov
- bto.gov
- buenavistatownship.org
- c3.gov
- carboncyclescience.gov
- choctawnationflorida.org
- citizencosponsors.gov
- cityoffallon.org
- cityofpointarena.com
- climatechange.gov
- conservation.gov
- consumerfinancialbureau.gov
- counterwmd.gov
- dcpsa.gov
- disastercontractingassistance.gov
- doleta.gov
- dottrcc.gov
- e3.gov
- easton.me.us
- efaca.gov
- eforms.gov
- ellsburgtownship.org
- epa-echo.gov
- epa-otis.gov
- erdc.gov
- espanol.gov
- ets.prod.carlson.com
- fcsm.gov
- fdr.gov
- fedcentennial.gov
- fedcir.gov
- federalreservecentennialcelebration.gov
- federaltransparency.gov
- fedforms.gov
- fedrealestate.gov
- fedstats.gov
- fero.gov
- fightingmalaria.gov
- fmip.gov
- frc.gov
- frcc.gov
- frs.gov
- frtibtest.gov
- fswg.gov
- galenamd.com
- gopconference.gov
- governordejongh.com
- gpoaccess.gov
- hartsvillesc.com
- hawaiipublicschools.org
- heuveltonny.us
- history.gov
- homermich.org
- huntsdalemo.com
- iscience.gov
- jacksonvillage.net
- jmc.gov
- kb.cert.org
- kewaskumsausage.com
- kyehealth.org
- kyheritage.org
- lasegundacosa.gov
- makotind.com
- manchester-ga.com
- mapstats.gov
- marlboro.vt.us
- marview.gov
- mayesvillesc.com
- mckeesport.org
- milfordnj.org
- modoccounty.us
- mojavedata.gov
- mypay.gov
- nara-at-work.gov
- nasa.asee.org
- nepa.gov
- niftt.gov
- nmic.gov
- nmsc.gov
- noaawatch.gov
- nonprofit.gov
- norfolkny.us
- northfranklintownship.com
- opportunity.gov
- osagetribe.com
- oti.gov
- paperworkreduction.gov
- pci.gov
- peacecorpsoig.gov
- peakcfc.com
- richburgsc.net
- riverheadli.com
- rumseyrancheria.org
- safexchange.gov
- sandbox.gov
- sangervillemaine.org
- scrdc.org
- seniors.gov
- shepherdstown.us
- smallbusiness.gov
- socialsecurityadvisoryboard.gov
- sourisnd.com
- springfieldnh.net
- sprucepineonline.com
- tannersvilleny.org
- tda.gov
- thesecondthing.gov
- townofheathsprings.org
- townofhornbeck.com
- townofticonderoga.com
- townofwindham.com
- township.clinton.nj.us
- transportationresearch.gov
- trs.gov
- tsptest.gov
- uscapitolvisitorcenter.gov
- uscavc.gov
- uscva.gov
- uscvc.gov
- utemountainute.com
- verifypayment.gov
- vet-biz.gov
- vetapp.gov
- vetsuccess.gov
- villageofbrocton.com
- villageofjohnstown.org
- volunteersforprosperity.gov
- waterfordny.org
- wingnd.com
academic:
- aces.edu
- aces.nmsu.edu
- ag.umass.edu
- ag.unr.edu
- agrilifeextension.tamu.edu
- alsde.edu
- bie.edu
- bushlibrary.tamu.edu
- cce.cornell.edu
- cdse.edu
- ces.ca.uky.edu
- cga.edu
- clemson.edu
- consensus.fsu.edu
- ctg.albany.edu
- dodea.edu
- earthkam.ucsd.edu
- edis.ifas.ufl.edu
- energy.wsu.edu
- ext.vt.edu
- ext.wsu.edu
- ext.wvu.edu
- extension.iastate.edu
- extension.ifas.ufl.edu
- extension.illinois.edu
- extension.missouri.edu
- extension.nmsu.edu
- extension.oregonstate.edu
- extension.psu.edu
- extension.purdue.edu
- extension.udel.edu
- extension.umaine.edu
- extension.umn.edu
- extension.unh.edu
- extension.unl.edu
- extension.usu.edu
- external.oneonta.edu
- fbiacademy.edu
- fcs.okstate.edu
- fdrlibrary.marist.edu
- forest.moscowfsl.wsu.edu
- geoinfo.nmt.edu
- georgewbushlibrary.smu.edu
- ianrpubs.unl.edu
- mbmg.mtech.edu
- msue.anr.msu.edu
- ncsu.edu
- ndu.edu
- nfs.unl.edu
- nps.edu
- origins.ou.edu
- otscweb.tamu.edu
- passhe.edu
- rd.okstate.edu
- reagan.utexas.edu
- rmrs.nau.edu
- sahp.vcu.edu
- schev.edu
- sled.alaska.edu
- tais.tamu.edu
- teexweb.tamu.edu
- texasextension.tamu.edu
- tsbvi.edu
- tti.tamu.edu
- tvmdl.tamu.edu
- txforestservice.tamu.edu
- uaex.edu
- uaf.edu
- ucop.edu
- uidaho.edu
- usafa.edu
- usda.mannlib.cornell.edu
- usma.edu
- usmcu.edu
- usmma.edu
- usna.edu
- usu.edu
- usuhs.edu
- uvm.edu
- uwex.edu
- uwyo.edu
- web.uri.edu
- westpoint.edu
- wvnet.edu
govoffice:
- adellwi.govoffice2.com
- anderson.govoffice.com
- baldwintownship.govoffice.com
- baltic.govoffice.com
- bigfalls.govoffice.com
- bladennc.govoffice3.com
- bolton.govoffice.com
- bottineau.govoffice.com
- brandon.govoffice.com
- brockwaytownship.govoffice.com
- brooten.govoffice.com
- browerville.govoffice.com
- campbellsport.govoffice.com
- canby.govoffice.com
- chester.govoffice.com
- chilton.govoffice.com
- china.govoffice.com
- clio.govoffice.com
- coldspring.govoffice.com
- collegetownship.govoffice.com
- conrad.govoffice.com
- corinna.govoffice.com
- crookedlake.govoffice2.com
- custer.govoffice.com
- denaliborough.govoffice.com
- desmet.govoffice2.com
- driggs.govoffice.com
- dunncountywi.govoffice2.com
- eastgulllake.govoffice.com
- edenvalley.govoffice.com
- edgewaterco.govoffice3.com
- epping.govoffice.com
- evansdale.govoffice.com
- eyota.govoffice.com
- faith.govoffice.com
- fordvillecitynd.govoffice2.com
- gatescounty.govoffice2.com
- goodview.govoffice.com
- granvillenc.govoffice2.com
- griswoldia.govoffice2.com
- hallowell.govoffice.com
- harrisburg.govoffice.com
- hartland.govoffice.com
- haverhillnh.govoffice3.com
- hawley.govoffice.com
- hector.govoffice.com
- henriettatownship.govoffice2.com
- highmoresd.govoffice3.com
- hinsdale.govoffice.com
- houston.govoffice.com
- humboldt.govoffice.com
- independence.govoffice.com
- janesville.govoffice.com
- jordan.govoffice.com
- kalmar.govoffice.com
- keewatin.govoffice.com
- kelliher.govoffice.com
- lakelandshores.govoffice.com
- lakelillian.govoffice.com
- lakenorden.govoffice.com
- lilydale.govoffice.com
- lincolnpark.govoffice.com
- lismore.govoffice2.com
- lonsdale.govoffice.com
- lscb.govoffice.com
- lubecme.govoffice2.com
- manchester.govoffice2.com
- marine.govoffice.com
- marion.govoffice2.com
- mechanicfalls.govoffice.com
- midwaytwpmn.govoffice2.com
- monmouthme.govoffice2.com
- myersvillemd.govoffice2.com
- nashwauk.govoffice.com
- nevis.govoffice.com
- newyorkmills.govoffice2.com
- oakfield.govoffice.com
- palmyra.govoffice.com
- parkrivernd.govoffice2.com
- pembina.govoffice.com
- pinecity.govoffice.com
- plainwi.govoffice2.com
- randall.govoffice2.com
- randolphvt.govoffice2.com
- ravennatwpmn.govoffice2.com
- readfield.govoffice.com
- readingvt.govoffice.com
- richardsontownship.govoffice2.com
- roosevelttownship.govoffice.com
- roxbury.govoffice2.com
- rushford.govoffice.com
- rushfordvillage.govoffice.com
- sandstone.govoffice.com
- saratoga.govoffice2.com
- sherburn.govoffice.com
- slayton.govoffice.com
- solon.govoffice.com
- sor.govoffice3.com
- springfieldvt.govoffice2.com
- stclair.govoffice2.com
- stpaulpark.govoffice.com
- sunset.govoffice2.com
- sunvalley.govoffice.com
- surfcity.govoffice.com
- sylvanc.govoffice3.com
- tamacity.govoffice2.com
- thomson.govoffice.com
- timnathco.govoffice2.com
- trimont.govoffice.com
- twinvalley.govoffice.com
- tyler.govoffice.com
- wabedo.govoffice.com
- wahpetonia.govoffice.com
- westbath.govoffice.com
- westerly.govoffice.com
- westlakeland.govoffice2.com
- weston.govoffice.com
- westwindsorvt.govoffice2.com
- whitewood.govoffice.com
- winlockwa.govoffice2.com
- woodrowtwpmn.govoffice2.com
- wt.govoffice.com
- yakutatak.govoffice2.com
- zimmerman.govoffice.com
homestead:
- biltmoreforesttownhall.homestead.com
- botco.homestead.com
- cityofsumas.homestead.com
- echotacherokeetribe.homestead.com
- townofbyromville.homestead.com
blogspot.com:
- brantleycountyga.blogspot.com
- peachamblog.blogspot.com
blacklist:
- business.centurytel.net
- chesnee.net
- citlink.net
- egovlink.com
- emainehosting.com
- fantasyspringsresort.com
- frontiernet.net
- hartford-hwp.com
- homepages.sover.net
- htc.net
- koasekabenaki.org
- kstrom.net
- laworkforce.net
- mississippistateparks.reserveamerica.com
- mylocalgov.com
- myweb.cebridge.net
- ncstars.org
- neagrelations.org
- qis.net
- rootsweb.com
- showcase.netins.net
- valuworld.com
- wctc.net
- webconnections.net
- webpages.charter.net
wordpress:
- cityofortonville.wordpress.com
- halcottcenter.wordpress.com
- hallsvillemissouri.wordpress.com
- hudsonmaine.wordpress.com
- thetownofsixmile.wordpress.com
- usresponserestoration.wordpress.com
tripod.com:
- emmonscounty.tripod.com
- lincolnboro.tripod.com
home. regex:
- home.myfairpoint.net
- home.webryders.net
- home.windstream.net
squarespace.com:
- nobleco.squarespace.com
github.io:
- project-open-data.github.io
sites. regex:
- sites.google.com
weebly:
- towncenterville.weebly.com
user. regex:
- user.camtel.net
- users.bestweb.net
wix.com:
- washtwp.wix.com
afeijoo commented 8 years ago

Thanks, @benbalter. You're right that some differences are intentional. For example, we include military academies, cooperative extensions, and a few other .edu's. We'll work our way through the rest of your list and update our records as needed.

afeijoo commented 8 years ago

@benbalter Several of the missing domains redirect to new domains. You can find the new, "preferred" domains by querying our Non-.gov URLs API. For example, http://govt-urls.api.usa.gov/government_urls/search?q=sgch.org shows the new domain is sangabrielcity.com.