GSA / code-gov-harvester

Stand alone metadata harvester for Code.gov
Creative Commons Zero v1.0 Universal
3 stars 7 forks source link

Improving data coverage #27

Open Nosferican opened 4 years ago

Nosferican commented 4 years ago
acronym status source      
DHS fallback https://raw.githubusercontent.com/GSA/code-gov-harvester/master/data/fallback/DHS.json https://www.dhs.gov/code/json https://github.com/usdhs  
DOJ fallback https://raw.githubusercontent.com/GSA/code-gov-harvester/master/data/fallback/DOJ.json https://www.justice.gov/digitalstrategy https://github.com/usdoj  
EPA fallback https://raw.githubusercontent.com/GSA/code-gov-harvester/master/data/fallback/EPA.json https://edg.epa.gov/data.json https://github.com/USEPA  
NSF fallback https://raw.githubusercontent.com/GSA/code-gov-harvester/master/data/fallback/NSF.json https://www.nsf.gov/digitalstrategy/ https://github.com/nsf-open  
NARA fallback https://raw.githubusercontent.com/GSA/code-gov-harvester/master/data/fallback/NARA.json https://www.archives.gov/digitalstrategy https://github.com/usnationalarchives https://www.archives.gov/developer
DOI NULL NULL https://github.com/usinterior    
DOC NULL NULL https://github.com/CommerceGov https://github.com/usnistgov https://github.com/NOAA-GFDL
DOS NULL NULL https://www.state.gov/digital-government-strategy/  
USAID NULL NULL https://www.usaid.gov/usaid-digital-strategy https://github.com/USAID  
NRC NULL NULL https://www.nrc.gov/public-involve/open/digital-government.html https://www.nrc.gov/developer.html  
OPM NULL NULL https://www.opm.gov/blogs/OpenOPM/digital-government-strategy/  
USGS NULL NULL https://github.com/usgs/code-json-generator    
NSA NULL NULL https://code.nsa.gov/ https://github.com/nationalsecurityagency
EOP NULL NULL https://raw.githubusercontent.com/EOP-OMB/code_json/master/code.json  
HUD OK https://www.hud.gov/sites/documents/CODE_INVENTORY.JSON      
USDA OK https://www.usda.gov/sites/default/files/documents/code.json      
DOL OK https://www.dol.gov/code.json      
DOT OK https://www.transportation.gov/sites/dot.gov/files/docs/code.json      
TREASURY OK https://s3.amazonaws.com/static.treasury.gov/jsonfiles/code.json      
VA OK https://www.va.gov/code.json      
NASA OK https://raw.githubusercontent.com/nasa/Open-Source-Catalog/master/code.json      
GSA OK https://open.gsa.gov/code.json      
SBA OK https://www.sba.gov/code.json      
SSA OK https://www.ssa.gov/code.json      
CFPB OK https://www.consumerfinance.gov/code.json      
DOD OK https://code.mil/code.json      
ED OK https://www2.ed.gov/code.json      
DOE OK https://www.energy.gov/sites/prod/files/2019/07/f64/code-07-17-2019_0.json      
HHS OK https://www.hhs.gov/code.json      
FEC OK https://www.fec.gov/code.json      
Nosferican commented 4 years ago

@IanLee1521 have y'all look into collecting the user/organizations for US federal agencies? @CalvinIsch, could you look into finding users/organizations for US federal entities on GitHub? @saracope What's the process to touch base with agencies for updating their code.json or reviving those?

A few special cases

IanLee1521 commented 4 years ago

Hi @Nosferican -- I haven't specifically, but mostly because there is already https://government.github.com/community/ which has that data.

I did run github.com/llnl/scraper against the U.S. entries on that site (at the time) and posted those results on the pull request: https://github.com/LLNL/scraper/pull/3

Is that what you were thinking, or something else?

Nosferican commented 4 years ago

The GitHub Government Community collection is a crowd-sourced initiative, but it isn't curated per se... For example, you have some organizations that are definitely not U.S. Federal dept/agencies (e.g., @radiofreeasia is under U.S. Federal, but is an NGO). We were thinking of querying the name of U.S. department and agencies and some combinations (e.g., US $name, acronym) against the GH Torrent organization name or GraphQL API.

For obtaining a list of U.S. federal dept/agencies and other entities we were thinking of using A-Z Index of U.S. Government Departments and Agencies which has a directory based on the U.S. Government Manual supplemented with entities that "directly serve the public" (e.g., USDA National Agricultural Statistics Service NASS). Ideally M-16-21 should have the dept/agency heads report monitor it, but I suspect it isn't occurring as diligently as it could be.

That proposal seems like the best approach considering the lack of access to an exhaustive list of (at least public) U.S. federal dept/agencies and the different organizational levels.

For example,

It is unclear at what levels would a GitHub organization be set up. It is unclear which levels would show up in budget / organizational databases.

Some of the heuristics we might use are to require the GitHub organization to have listed a website with a .gov/.mil domain. Not all federal government website use those domains though (e.g., goarmy.com). Also not require organizations to be verified (usually requires the including some metadata at the website which is usually only reserved at the department level).