Closed palewire closed 2 years ago
I think we should switch to a strict mapping of raw headers to clean headers, rather than infer anything.
Here's what I get with a dumped out print.
2022-07-03 10:38:42,641 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/warn-report-for-7-1-2021-to-06-30-2022.pdf for PDF parsing
{'County', 'Effective \nDate', 'Address', 'No. Of Employees', 'Notice\nDate', 'Company', 'Layoff/Closure Type', 'Received \nDate'}
2022-07-03 10:38:50,220 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2020-to-06-30-2021.pdf for PDF parsing
{'Effective\nDate', 'County', 'Notice\nDate', 'Layoff/Closure Type', 'Company', 'No. Of \nEmployees', 'City', 'Received\nDate'}
2022-07-03 10:39:13,612 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2019-to-6-30-2020.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'County', 'Received Date', 'Employees', 'Effective Date', 'Company', 'City'}
2022-07-03 10:39:59,089 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2018-to-06-30-2019.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'County', 'Effective \nDate', 'Company', 'No. Of \nEmployees', 'City', 'Received \nDate'}
2022-07-03 10:40:07,821 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2017-to-06-30-2018.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'County', 'Effective \nDate', 'Company', 'No. Of \nEmployees', 'City', 'Received \nDate'}
2022-07-03 10:40:20,329 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2016-to-06-30-2017.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'Effective \nDate', 'Company', 'No. Of \nEmployees', 'City', 'Received \nDate'}
2022-07-03 10:40:32,300 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARN-Report-for-7-1-2015-to-06-30-2016.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'Effective \nDate', 'Company', 'No. Of \nEmployees', 'City', 'Received \nDate'}
2022-07-03 10:40:45,686 - warn.scrapers.ca - Opening .warn-scraper/cache/ca/WARNReportfor7-1-2014to06-30-2015.pdf for PDF parsing
{'Layoff/Closure', 'Notice Date', 'Effective \nDate', 'Company', 'No. Of \nEmployees', 'City', 'Received \nDate'}
It looks like our parser depends on the header order always being the same. Guess what, they changed it in the new file.