camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

estract only and always first page by this type pdf generate by Oracle #378

Open elyparker opened 1 year ago

elyparker commented 1 year ago

Describe the bug I have to extract tables from pdf file generated from Oracle Reports 10gR2

for first page only ok but aftr second page and other i cannot access completely

example of file oda_2.pdf

Steps to reproduce the bug in a folder where is file oda_2.pdf

i launch python shell

import camelot
handler = camelot.handlers.PDFHandler("oda_2.pdf",'all')
 page_list = handler._get_pages("oda_2.pdf")

_Traceback (most recent call last): File "", line 1, in File "C:\OSGeo4W\apps\Python39\lib\site-packages\camelot\handlers.py", line 90, in _get_pages page_numbers.append({"start": int(r), "end": int(r)}) ValueError: invalid literal for int() with base 10: 'oda2.pdf'

 tables=camelot.read_pdf("oda_2.pdf", flavor='stream', page=2)
 print (tables[0].parsing_report)

{'accuracy': 93.8, 'whitespace': 70.16, 'order': 1, 'page': 1} print (tables[0].df)

output = table of first page not second

camelot, version 0.11.0 Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)] on win32

i think that pdf files is malformed but acroread and other program open it, why camelot not?

do it exist a way? thank