Open gbinal opened 9 years ago
I believe this should be the URLs for the above:
http://usda.gov/data.json
http://www.commerce.gov/data.json
http://nist.gov/data.json
http://data.noaa.gov/data.json
http://www.defense.gov/data.json
http://www2.ed.gov/data.json
http://www.energy.gov/data.json
http://nrel.gov/data.json
http://healthdata.gov/data.json
http://www.dhs.gov/sites/default/files/publications/digital-strategy/data.json
http://www.hud.gov/data.json
http://www.doi.gov/data.json
http://www.justice.gov/data.json
http://www.dol.gov/data.json
http://www.state.gov/data.json
http://www.dot.gov/data.json
http://treasury.gov/data.json
http://www.va.gov/data.json
http://www.usaid.gov/data.json
http://www.epa.gov/data.json
http://www.gsa.gov/data.json
http://www.nasa.gov/data.json
http://www.archives.gov/data.json
http://www.nrc.gov/data.json
http://www.nsf.gov/data.json
http://www.opm.gov/data.json
https://www.sba.gov/sites/default/files/data.json
http://www.ssa.gov/data.json
http://www.consumerfinance.gov/data.json
http://www.fhfa.gov/data.json
http://www.imls.gov/data.json
http://data.mcc.gov/raw/index.json
http://www.nitrd.gov/data.json
http://www.ntsb.gov/data.json
http://www.sec.gov/data.json
https://open.whitehouse.gov/data.json
This is the broken way the NSF represents API records in their data.json
{
@type: "dcat:Dataset",
title: "NSF Award Search Web API",
accessLevel: "public",
contactPoint: {
@type: "vcard:Contact",
fn: "Nancy Kaplan",
hasEmail: "mailto:nkaplan@nsf.gov"
},
description: "The NSF Award Search web API provides a web API interface to the Research.gov's Research Spending and Results data, which provides NSF research award information from 2007.",
identifier: "1102",
keyword: [
"nasa",
"national aeronautics and space administration national aeronautics and space administration stem",
"national science foundation",
"nsf",
"research and education",
"science and engineering"
],
license: "http://www.nsf.gov/",
modified: "P1D",
publisher: {
@type: "org:Organization",
name: "National Science Foundation"
},
distribution: [
{
@type: "dcat:Distribution",
downloadURL: "http://www.research.gov/common/webapi/awardapisearch-v1.htm",
mediaType: "application/json"
}
],
bureauCode: [
"422:00"
],
programCode: [
"422:011"
]
}
This is the nonstandard data.json that NREL uses
{
title: "PVWatts",
description: "PVWatts calculates the energy production and cost savings of grid-connected photovoltaic (PV) energy systems. This service estimates the performance of hypothetical residential and small commercial PV installations.",
keyword: "solar, photovoltaic, PV, calculator, payback",
modified: "2013-05-01",
publisher: "National Renewable Energy Laboratory",
person: "NREL Open Data",
mbox: "data@nrel.gov",
identifier: "392cf124-b37e-4d3c-b04c-29a2fd3cfabd",
accessLevel: "public",
webService: "http://developer.nrel.gov/api/pvwatts/v4.json",
landingPage: "http://developer.nrel.gov/doc/pvwatts",
references: "http://developer.nrel.gov/doc/api/pvwatts/v4",
spatial: "United States"
},
Agencies with no API records in their data.json:
This is how the MCC represents their data.json
{
publisher: "Millennium Challenge Corporation",
license: "data.mcc.gov terms of use - http://data.mcc.gov/termsofuse.html",
description: "MCC Open Data API",
language: "English",
title: "Open Data API",
issued: "5/1/13 0:00",
format: "json",
landingPage: "http://data.mcc.gov/developers",
modified: "5/1/13 0:00",
systemOfRecords: "Open Data Catalog",
person: "Open Data Initiative",
theme: "Open Data API",
keyword: "data, api",
identifier: "data-api",
dataDictionary: "http://data.mcc.gov/performance/projects.html",
accessLevel: "Public",
mbox: "opendata@mcc.gov",
webService: "http://data.mcc.gov/api"
},
The following agencies are giving me errors when I attempt to crawl them
Timeouts
Malformed JSON
404 Not Found
Here are the counts I have so far
http://www.usaid.gov/data.json
+ 5 APIs found
https://www.sba.gov/sites/default/files/data.json
+ 3 APIs found
http://www.consumerfinance.gov/data.json
+ 2 APIs found
http://www.archives.gov/data.json
+ 3 APIs found
http://www.dot.gov/data.json
+ 12 APIs found
http://www.ssa.gov/data.json
+ 3 APIs found
http://www.opm.gov/data.json
+ 5 APIs found
http://www.gsa.gov/data.json
+ 9 APIs found
http://treasury.gov/data.json
+ 5 APIs found
http://usda.gov/data.json
+ 62 APIs found
http://www.hud.gov/data.json
+ 34 APIs found
http://www.epa.gov/data.json
+ 430 APIs found
http://www.commerce.gov/data.json
+ 5 APIs found
http://www.energy.gov/data.json
+ 19 APIs found
http://www.dol.gov/data.json
+ 181 APIs found
http://healthdata.gov/data.json
+ 3 APIs found
http://www.va.gov/data.json
+ 2 APIs found
http://www.nasa.gov/data.json
+ 4 APIs found
Excessively high counts are an indication of something bad in the data. For instance, here are some APIs returned in the EPA's data.json
+ file:////r6gis1/share1/Facilities/FRP/R6_FRP_20110505.gdb
+ file:////r6gis1/share1/Facilities/NPL/2012/NPL_2012.gdb/NPLpy09182012
+ http://www.epa.gov/superfund/sites/npl/status.htm
+ file:////r6gis1/share1/Admin/OK/OK_EmergencyManagementDirectors.gdb/OEMDirectors_Table
+ file:////r6gis1/share1/admin/OK/OK_Corporation_Commision_Districts.gdb/OK_Corporation_Commision_Districts
+ file:////r6gis1/share1/Admin/OK/OK_OHS_Regional_Response_System.gdb
+ file:////r6gis1/share1/Admin/Parcels/Parcel_status_r6.shp
+ https://www.edg.epa.gov/data/public/R6/Brownfields/R6Brownfields_kmz.zip
+ https://edg.epa.gov/data/public/R6/Brownfields/R6Brownfields_062612.zip
+ http://edg.epa.gov/data/public/R6/Brownfields/R6Brownfields.zip
+ file:////r6gis1/share1/Census/Census2010/PL94171_2010.gdb/R6_PL2010_Block
+ file:////r6gis1/share1/Census/Census2010/PL94171_2010_SumByOtherGeog.gdb/R6_PL2010_BlockGroup
+ file:////r6gis1/share1/admin/NM/NM_OCD_Divisions.gdb/NM_Oil_Conservation_Divisions
+ file:////r6gis1/share1/Air/Nonattainment/Nonattainment_July2012.shp
+ file:////r6gis1/share1/Air/Nonattainment/Nonattainment_2012.gdb/Nonattainment_2012
+ file:////r6gis1/share1/Air/Nonattainment/Nonattainment_2013.gdb/Nonattainment_July2012
+ https://edg.epa.gov/data/public/r6/NPL/NPLpt05122014.zip
+ https://edg.epa.gov/data/public/r6/npl/NPLpy05122014.zip
+ https://edg.epa.gov/data/Public/R6/Aquifers/R6SSAquifers.zip
+ file:////r6gis1/share1/Facilities/TRI/tri2011/r6tri2011.gdb/R6TRI2011
+ file:////r6gis1/share1/Border/Mexico/Mexico_Data.gdb/rail
+ https://edg.epa.gov/data/Public/R6/TEAP/TEAP_Data.zip
+ file:////r6gis1/share1/Facilities/RCRA/RCRA_Sites_Mar_2012.lyr
There is likely similar fragmentation in some other data.json examples from other agencies.
More data files here:
Data.gov API Results