atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 357 forks source link

Camelot - infinite waiting for extracting tables #333

Closed nielsvth closed 4 years ago

nielsvth commented 5 years ago

Dear all,

I used the following code to extract tables from a pdf file using the camelot module in python:

import camelot tables = camelot.read_pdf('report.pdf',pages='1-15') print (tables)

... nevertheless, the program doesn't return anything and I have to wait for an infinite amount of time, so I just end up killing the process in the end. It is does the same for different types of pdfs, some of which actually worked in the past, using the same code.

I dont get any errors when importing the module and also the pip install was successful. Anyone experienced a similar problem or has any clues on how to solve this issue?

Kind regards,

vinayak-mehta commented 5 years ago

Can you link the PDF so that I can reproduce the issue?

LanesG commented 5 years ago

I have the same problem. Even with the test pdf.

LanesG commented 5 years ago

Version 0.7.1 works, version 0.7.2 doesn't.

vinayak-mehta commented 5 years ago

Sorry for the delay in replies, please give me some time to look into this.

nielsvth commented 5 years ago

Can you link the PDF so that I can reproduce the issue?

The pdf I tried to process is a commercial (Fitch Solutions) market report, so dont have the authority to share it with third parties, I'am sorry but will try to help you with other information if you need to.

LanesG commented 5 years ago

@nielsvth Does the test pdf I linked work for you?

LanesG commented 5 years ago

Works for me again with 0.7.3.

vinayak-mehta commented 5 years ago

@nielsvth Can you check if the new release fixed your issue?

nielsvth commented 5 years ago

Hi,

for me the problem is still not solved with the new Version 0.7.3

Traceback (most recent call last): File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\project.py", line 3, in <module> tables = camelot.read_pdf(r'C:\Users\Niels\Documents\Python PDF Scraper\final\Files\report1.pdf',pages='1-15') File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\camelot\io.py", line 117, in read_pdf **kwargs File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\camelot\handlers.py", line 165, in parse self._save_page(self.filepath, p, tempdir) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\camelot\handlers.py", line 115, in _save_page outfile.write(f) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 482, in write self._sweepIndirectReferences(externalReferenceMap, self._root) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences newobj = data.pdf.getObject(data) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 1626, in getObject retval = self._decryptObject(retval, key) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\pdf.py", line 1640, in _decryptObject obj._data = utils.RC4_encrypt(key, obj._data) File "C:\Users\Niels\Documents\Web Scraping with Python\scrapingEnv\lib\site-packages\PyPDF2\utils.py", line 177, in RC4_encrypt i = (i + 1) % 256 KeyboardInterrupt

Hope this helps!

Kind regards,

Niels

vinayak-mehta commented 4 years ago

Closing this as I can't reproduce it without the PDF.