Transform the collected JSON datasets to CKAN harvester data.json format

nightsh commented 4 years ago

Blocked by #15 #19 #20 #21 #22 #23 #24

Upon running the scrapers, the collected data is dumped into an output directory structure. We need to traverse it, and for each scraping source (i.e. child directory) create a data.json file to incorporate all the dumped items.

Tasks:

[x] Create a transformer Py module
[x] Iterate through the list of output files in each directory
[x] Generate a data.json file according to a shared structure filled with data from the files
[x] Test by loading the data in the CKAN harvester, locally or remotely
[x] Avoid duplicates
[x] Avoid printable versions of the resources
[x] Test (update if needed) when the parsers are done
- [x] P1 Parser (OCR)
- [x] P2 Parser (OCTAE)
- [x] P3 Parser (OPE)
- [x] P4 Parser (OELA)
- [x] P5 Parser (OSERS)
- [x] P6 Parser (OPEPD)
- [x] P7 Parser (OESE)
- [x] NCES
- [x] ed.gov

Acceptance criteria:

[x] A data.json type file is generated for each scraping dump discovered
[x] There are no duplicate datasets per scraping source (even if they are collected multiple times, we only want to add them once)
[x] Transformer is generating a transform log, recording number of input files and output datasets/resources
[x] The data.json file is loadable by the CKAN harvester

nightsh commented 4 years ago

Tech feedback after attempting to load the first output datajson into CKAN:


### ERROR #1: 'temporal':'' is not valid under any of the given schemas;

### ERROR #2: 'theme':[] is not valid under any of the given schemas;

### ERROR #3: 'programCode':[] is not valid under any of the given schemas;

### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas;

### ERROR #5: 'contactPoint':'fn' is a required property;

### ERROR #6: 'contactPoint':'hasEmail' is a required property;

### ERROR #7: 'keyword':[] is not valid under any of the given schemas;

### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas;

### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas;

### ERROR #10: 'identifier':'' is too short.

nightsh commented 4 years ago

Updated errors:

### ERROR #1: 'bureauCode' is a required property;
### ERROR #2: 'programCode' is a required property;
### ERROR #3: 'keyword' is a required property;
### ERROR #4: 'contactPoint' is a required property;
### ERROR #5: 'distribution' <same as above>

nightsh commented 4 years ago

Last one standing:

### ERROR #1: 'distribution' <datajson contents> not valid under any of the given schemas.

nightsh commented 4 years ago

Datajson file that works:

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "datopian_data_json_ocr",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "title": "2011-12 Discipline Estimations for Nation and by State",
      "description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n                    For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.",
      "modified": "2020-03-05",
      "publisher": {
        "@type": "org:Organization",
        "name": "Office for Civil Rights"
      },
      "landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12",
      "identifier": "2011-12-discipline-estimations-for-nation-and-by-state",
      "accessLevel": "public",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "spatial": "United States",
      "bureauCode": ["018:50"],
      "programCode": ["018:000"],
      "keyword": ["ocr"],
      "contactPoint": {
        "@type": "vcard:Contact",
        "hasEmail": "mailto:info@viderum.com",
        "fn": "Office for Civil Rights"
      },
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "title": "National total",
          "description": "National total",
          "downloadURL": "../downloads/projections/2011-12/States/National Totals.xls",
          "format": "xls",
          "mediaType": "application/zip"
        }
      ]
    }
  ]
}

higorspinto commented 4 years ago

Tech feedback after attempting to load the first output datajson into CKAN:

### ERROR #1: 'temporal':'' is not valid under any of the given schemas;

### ERROR #2: 'theme':[] is not valid under any of the given schemas;

### ERROR #3: 'programCode':[] is not valid under any of the given schemas;

### ERROR #4: 'bureauCode':[] is not valid under any of the given schemas;

### ERROR #5: 'contactPoint':'fn' is a required property;

### ERROR #6: 'contactPoint':'hasEmail' is a required property;

### ERROR #7: 'keyword':[] is not valid under any of the given schemas;

### ERROR #8: 'modified':'03/04/2020' is not valid under any of the given schemas;

### ERROR #9: 'distribution': <the entire file contents here> is not valid under any of the given schemas;

### ERROR #10: 'identifier':'' is too short.

Leaving out from the output file all the blank fields.

higorspinto commented 4 years ago

Updated errors:

### ERROR #1: 'bureauCode' is a required property;
### ERROR #2: 'programCode' is a required property;
### ERROR #3: 'keyword' is a required property;
### ERROR #4: 'contactPoint' is a required property;
### ERROR #5: 'distribution' <same as above>

Including all required fields to harvesting Dataset:

bureauCode: using a default value - 018:40
programCode: using a default value - 018:000
keywords: using the department's acronym (ocr, octae, etc.)
contactPoint: using a default value - email:info@viderum.com, fn:department's name Resource:
mediaType: a description based on the file format

higorspinto commented 4 years ago

Datajson file that works:

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "datopian_data_json_ocr",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "title": "2011-12 Discipline Estimations for Nation and by State",
      "description": "This set of Excel files contains data for all disciplinary actions, presented for the nation and by state. \n                    For the nation and each state, there are three spreadsheets: students with and without disabilities, students with disabilities, and students without disabilities.",
      "modified": "2020-03-05",
      "publisher": {
        "@type": "org:Organization",
        "name": "Office for Civil Rights"
      },
      "landingPage": "https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12",
      "identifier": "2011-12-discipline-estimations-for-nation-and-by-state",
      "accessLevel": "public",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "spatial": "United States",
      "bureauCode": ["018:50"],
      "programCode": ["018:000"],
      "keyword": ["ocr"],
      "contactPoint": {
        "@type": "vcard:Contact",
        "hasEmail": "mailto:info@viderum.com",
        "fn": "Office for Civil Rights"
      },
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "title": "National total",
          "description": "National total",
          "downloadURL": "../downloads/projections/2011-12/States/National Totals.xls",
          "format": "xls",
          "mediaType": "application/zip"
        }
      ]
    }
  ]
}

Tested on OCR (P1 Parser) and OCTAE (P2 Parser).

nightsh commented 4 years ago

Datajson file fails schema validation, reopening.

CivicActions / edscrapers

Transform the collected JSON datasets to CKAN harvester data.json format #25