P1 Parsing: Office Civil Rights

nightsh commented 4 years ago

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

source URL (where was it scraped from)
title
name (usually a unique slug of title)
publisher
description
tags
date
person of contact (name)
person of contact (email)

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

[x] using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
[x] test the parser script and output the data in a spreadsheet format for all pages in the list
[x] integrate the script into the pipeline after the above validation

Acceptance criteria:

[x] script accepts raw HTML as a input
[x] correctly identifies pages that have or have not resources
[x] produces a Python structure with the properties in the list above
[x] returns None for no resources and a Python dictionary with the result otherwise

osahon-okungbowa commented 4 years ago

Tasks are clear

nightsh commented 4 years ago

New page parsing rules:

only extract (potential) dataset if there is at least one data file in it, i.e. no pdf/doc only items
there are pages with multiple datasets
- https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt
- https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12

CivicActions / edscrapers

P1 Parsing: Office Civil Rights #15