Closed nightsh closed 4 years ago
Tech feedback after attempting to load the first output datajson into CKAN:
### ERROR #1: 'temporal':'' is not valid under any of the given schemas;
### ERROR #2: 'theme':[] is not valid under any of the given schemas;
### ERROR #3: 'programCode':[] is not valid under any of the given schemas;
### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas;
### ERROR #5: 'contactPoint':'fn' is a required property;
### ERROR #6: 'contactPoint':'hasEmail' is a required property;
### ERROR #7: 'keyword':[] is not valid under any of the given schemas;
### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas;
### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas;
### ERROR #10: 'identifier':'' is too short.
Updated errors:
### ERROR #1: 'bureauCode' is a required property;
### ERROR #2: 'programCode' is a required property;
### ERROR #3: 'keyword' is a required property;
### ERROR #4: 'contactPoint' is a required property;
### ERROR #5: 'distribution' <same as above>
Last one standing:
### ERROR #1: 'distribution' <datajson contents> not valid under any of the given schemas.
Datajson file that works:
{
"@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
"@id": "datopian_data_json_ocr",
"@type": "dcat:Catalog",
"conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
"describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
"dataset": [
{
"@type": "dcat:Dataset",
"title": "2011-12 Discipline Estimations for Nation and by State",
"description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.",
"modified": "2020-03-05",
"publisher": {
"@type": "org:Organization",
"name": "Office for Civil Rights"
},
"landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12",
"identifier": "2011-12-discipline-estimations-for-nation-and-by-state",
"accessLevel": "public",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"spatial": "United States",
"bureauCode": ["018:50"],
"programCode": ["018:000"],
"keyword": ["ocr"],
"contactPoint": {
"@type": "vcard:Contact",
"hasEmail": "mailto:info@viderum.com",
"fn": "Office for Civil Rights"
},
"distribution": [
{
"@type": "dcat:Distribution",
"title": "National total",
"description": "National total",
"downloadURL": "../downloads/projections/2011-12/States/National Totals.xls",
"format": "xls",
"mediaType": "application/zip"
}
]
}
]
}
Tech feedback after attempting to load the first output datajson into CKAN:
### ERROR #1: 'temporal':'' is not valid under any of the given schemas; ### ERROR #2: 'theme':[] is not valid under any of the given schemas; ### ERROR #3: 'programCode':[] is not valid under any of the given schemas; ### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas; ### ERROR #5: 'contactPoint':'fn' is a required property; ### ERROR #6: 'contactPoint':'hasEmail' is a required property; ### ERROR #7: 'keyword':[] is not valid under any of the given schemas; ### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas; ### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas; ### ERROR #10: 'identifier':'' is too short.
Leaving out from the output file all the blank fields.
Updated errors:
### ERROR #1: 'bureauCode' is a required property; ### ERROR #2: 'programCode' is a required property; ### ERROR #3: 'keyword' is a required property; ### ERROR #4: 'contactPoint' is a required property; ### ERROR #5: 'distribution' <same as above>
Including all required fields to harvesting Dataset:
Datajson file that works:
{ "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld", "@id": "datopian_data_json_ocr", "@type": "dcat:Catalog", "conformsTo": "https://project-open-data.cio.gov/v1.1/schema", "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json", "dataset": [ { "@type": "dcat:Dataset", "title": "2011-12 Discipline Estimations for Nation and by State", "description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.", "modified": "2020-03-05", "publisher": { "@type": "org:Organization", "name": "Office for Civil Rights" }, "landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12", "identifier": "2011-12-discipline-estimations-for-nation-and-by-state", "accessLevel": "public", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "spatial": "United States", "bureauCode": ["018:50"], "programCode": ["018:000"], "keyword": ["ocr"], "contactPoint": { "@type": "vcard:Contact", "hasEmail": "mailto:info@viderum.com", "fn": "Office for Civil Rights" }, "distribution": [ { "@type": "dcat:Distribution", "title": "National total", "description": "National total", "downloadURL": "../downloads/projections/2011-12/States/National Totals.xls", "format": "xls", "mediaType": "application/zip" } ] } ] }
Tested on OCR (P1 Parser) and OCTAE (P2 Parser).
Datajson file fails schema validation, reopening.
Upon running the scrapers, the collected data is dumped into an output directory structure. We need to traverse it, and for each scraping source (i.e. child directory) create a data.json file to incorporate all the dumped items.
Tasks:
Acceptance criteria: