CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

P1 Parsing: Office Civil Rights #15

Closed nightsh closed 4 years ago

nightsh commented 4 years ago

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

Acceptance criteria:

osahon-okungbowa commented 4 years ago

Tasks are clear

nightsh commented 4 years ago

New page parsing rules: