camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.97k stars 468 forks source link

Double quotation mark causes issue - used to denote inches as measurement (common occurrence) #274

Open optimizasean opened 2 years ago

optimizasean commented 2 years ago

Running latest python and camelot.

PDFs which contain tables that have " in them misbehave and fail

example: 200" where " is used to denote inches. I assume the same would cause problems for ' as feet.

If there are multiple of the above like 100" or 200" or 300" it also misbehaves and does strange things in the output.

If you use the lattice parser and it separates by lattice for a well formed table, that should not be a problem as it should ignore 100" and just put that as it's in its own cell -> | 100" | and should have no reason to misbehave with lattice?

anakin87 commented 2 years ago

Please attach an example PDF, if you can.

optimizasean commented 2 years ago

The PDF I discovered this on, I am unfortunately not able to share. I also don't have the tools to create a PDF on my machine. I generally avoid working with PDFs at all costs.