This needs some re-engineering, because three formats are in play and the overhead and maintenance serve no practical utility. The current scraper parses an older PDF and also scrapes data inside paragraph tags, which were used through 2023.
The new format gets marked up by JavaScript in the browsers to make parsing a bit more annoying, but is using tables and rows. This shouldn't be difficult to finish.
Archived copies of the PDF and HTML file should be zipped up into the appropriate BLN bucket, and already-parsed CSVs from a previously successful run should be downloaded on each parser run instead.
The scraper can drop the PDF dependency after that.
This needs some re-engineering, because three formats are in play and the overhead and maintenance serve no practical utility. The current scraper parses an older PDF and also scrapes data inside paragraph tags, which were used through 2023.
The new format gets marked up by JavaScript in the browsers to make parsing a bit more annoying, but is using tables and rows. This shouldn't be difficult to finish.
Archived copies of the PDF and HTML file should be zipped up into the appropriate BLN bucket, and already-parsed CSVs from a previously successful run should be downloaded on each parser run instead.
The scraper can drop the PDF dependency after that.