As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.
[x] using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
[x] test the parser script and output the data in a spreadsheet format for all pages in the list
[x] integrate the script into the pipeline after the above validation
Acceptance criteria:
[x] script accepts raw HTML as a input
[x] correctly identifies pages that have or have not resources
[x] produces a Python structure with the properties in the list above
[x] returns None for no resources and a Python dictionary with the result otherwise
As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.
Desired properties for the resulting datasets:
List of pages to get information from:
List of "false positives" that should bear no dataset information:
Tasks:
Acceptance criteria:
None
for no resources and a Python dictionary with the result otherwise