CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

Transform the collected JSON datasets to CKAN harvester data.json format #25

Closed nightsh closed 4 years ago

nightsh commented 4 years ago

Blocked by #15 #19 #20 #21 #22 #23 #24

Upon running the scrapers, the collected data is dumped into an output directory structure. We need to traverse it, and for each scraping source (i.e. child directory) create a data.json file to incorporate all the dumped items.

Tasks:

Acceptance criteria:

nightsh commented 4 years ago

Tech feedback after attempting to load the first output datajson into CKAN:


### ERROR #1: 'temporal':'' is not valid under any of the given schemas;

### ERROR #2: 'theme':[] is not valid under any of the given schemas;

### ERROR #3: 'programCode':[] is not valid under any of the given schemas;

### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas;

### ERROR #5: 'contactPoint':'fn' is a required property;

### ERROR #6: 'contactPoint':'hasEmail' is a required property;

### ERROR #7: 'keyword':[] is not valid under any of the given schemas;

### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas;

### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas;

### ERROR #10: 'identifier':'' is too short.
nightsh commented 4 years ago

Updated errors:

### ERROR #1: 'bureauCode' is a required property;
### ERROR #2: 'programCode' is a required property;
### ERROR #3: 'keyword' is a required property;
### ERROR #4: 'contactPoint' is a required property;
### ERROR #5: 'distribution' <same as above>
nightsh commented 4 years ago

Last one standing:

### ERROR #1: 'distribution' <datajson contents> not valid under any of the given schemas.
nightsh commented 4 years ago

Datajson file that works:

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "datopian_data_json_ocr",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "title": "2011-12 Discipline Estimations for Nation and by State",
      "description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n                    For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.",
      "modified": "2020-03-05",
      "publisher": {
        "@type": "org:Organization",
        "name": "Office for Civil Rights"
      },
      "landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12",
      "identifier": "2011-12-discipline-estimations-for-nation-and-by-state",
      "accessLevel": "public",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "spatial": "United States",
      "bureauCode": ["018:50"],
      "programCode": ["018:000"],
      "keyword": ["ocr"],
      "contactPoint": {
        "@type": "vcard:Contact",
        "hasEmail": "mailto:info@viderum.com",
        "fn": "Office for Civil Rights"
      },
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "title": "National total",
          "description": "National total",
          "downloadURL": "../downloads/projections/2011-12/States/National Totals.xls",
          "format": "xls",
          "mediaType": "application/zip"
        }
      ]
    }
  ]
}
higorspinto commented 4 years ago

Tech feedback after attempting to load the first output datajson into CKAN:

### ERROR #1: 'temporal':'' is not valid under any of the given schemas;

### ERROR #2: 'theme':[] is not valid under any of the given schemas;

### ERROR #3: 'programCode':[] is not valid under any of the given schemas;

### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas;

### ERROR #5: 'contactPoint':'fn' is a required property;

### ERROR #6: 'contactPoint':'hasEmail' is a required property;

### ERROR #7: 'keyword':[] is not valid under any of the given schemas;

### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas;

### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas;

### ERROR #10: 'identifier':'' is too short.

Leaving out from the output file all the blank fields.

higorspinto commented 4 years ago

Updated errors:

### ERROR #1: 'bureauCode' is a required property;
### ERROR #2: 'programCode' is a required property;
### ERROR #3: 'keyword' is a required property;
### ERROR #4: 'contactPoint' is a required property;
### ERROR #5: 'distribution' <same as above>

Including all required fields to harvesting Dataset:

higorspinto commented 4 years ago

Datajson file that works:

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "datopian_data_json_ocr",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "title": "2011-12 Discipline Estimations for Nation and by State",
      "description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n                    For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.",
      "modified": "2020-03-05",
      "publisher": {
        "@type": "org:Organization",
        "name": "Office for Civil Rights"
      },
      "landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12",
      "identifier": "2011-12-discipline-estimations-for-nation-and-by-state",
      "accessLevel": "public",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "spatial": "United States",
      "bureauCode": ["018:50"],
      "programCode": ["018:000"],
      "keyword": ["ocr"],
      "contactPoint": {
        "@type": "vcard:Contact",
        "hasEmail": "mailto:info@viderum.com",
        "fn": "Office for Civil Rights"
      },
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "title": "National total",
          "description": "National total",
          "downloadURL": "../downloads/projections/2011-12/States/National Totals.xls",
          "format": "xls",
          "mediaType": "application/zip"
        }
      ]
    }
  ]
}

Tested on OCR (P1 Parser) and OCTAE (P2 Parser).

nightsh commented 4 years ago

Datajson file fails schema validation, reopening.