camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.02k stars 473 forks source link

Decimal Points Missed Entirely by New Version #290

Open sometimesabird opened 2 years ago

sometimesabird commented 2 years ago

Describe the bug

Decimal points are sometimes not read by the program despite being in the pdf text. I.e., it reads "1.5" as "15". The is a new bug, as version 0.7.2 was working correctly. The current version (0.10.1) as well as 0.7.3 both fail.

Steps to reproduce the bug

  1. Download this table.
  2. Install camelot: pip install camelot-py==0.10.1
  3. Run

camelot -p all -o "test-NEW.csv" -f csv -split -strip ".\n" lattice -scale 100 -copy v "369746.pdf"

  1. Install older version: pip install camelot-py==0.7.2
  2. Run

camelot -p all -o "test-OLD.csv" -f csv -split -strip ".\n" lattice -scale 100 -copy v "369746.pdf"

  1. Open test-NEW-page-1-table-1.csv and test-OLD-page-1-table-1.csv.

Expected behavior

Line 2 of test-OLD.csv is what we should have:

"SMM camera at Donetsk Filtration Station (15km N of Donetsk)","0.5-1.5km","S","Recorded","2","Projectile","From E to W","N/K","31-Jan","19:35"

Line 2 of test-NEW.csv is misread: "SMM camera at Donetsk Filtration Station (15km N of Donetsk)","05-15km","S","Recorded","1","Projectile","From E to W","N/K","31-Jan","19:34"

(Note that the same thing happens to the column name located in the first row -- "No." is converted into "No".)

PDF

PDF

Environment

ramSeraph commented 2 years ago

Sounds more like a bug that has been fixed.. you seem to be passing '.' in the strip argument.. that is supposed to strip the decimal points.

sometimesabird commented 2 years ago

Oh, so it's mean to strip any of the characters, not this particular sequence?

ramSeraph commented 2 years ago

It looks that way from the code..

https://github.com/camelot-dev/camelot/blob/644bbe7c6d57b95aefa2f049a9aacdbc061cc04f/camelot/utils.py#L503-L505

It used to only strip at the end of the line, but now it strips from the whole line.

It was changed in this commit.

But even in its previous version it was always any of the characters.. It looks like.