arderyp / scotuswebcites

United States Supreme Count web citation discovery, presentation, and validation
GNU General Public License v3.0
1 stars 0 forks source link

some PDFs are poorly formatted and fail to scrape with pdfminer #15

Closed arderyp closed 8 years ago

arderyp commented 8 years ago

example: FERC v. Electric Power Supply Assn. [REVISION]

Delete it from the database and run discovery to see the error.

arderyp commented 8 years ago

testing with slate. Gets through the broken PDFs, but seems to affect the urls string parsing/glueing. Tests are failing.

The slate method seems to pull text differently, so will have to re-adjust the splitting/gluing rules. That being said, this method is getting through all PDFs without failing, and picked up 2 extra citations to boot. There is a new unicode error to look into though:

scotuswebcites.io/citations/models.py:64: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

arderyp commented 8 years ago

tests are passing now. More on slate here: https://github.com/timClicks/slate