camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.01k stars 473 forks source link

NotImplementedError: File format not supported #505

Open kushalmraut opened 3 months ago

kushalmraut commented 3 months ago

for some pdf links i am getting this error NotImplementedError: File format not supported

[<ipython-input-11-0615a449639b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')

2 frames
[/usr/local/lib/python3.10/dist-packages/camelot/utils.py](https://localhost:8080/#) in download_url(url)
     87         content_type = obj.info().get_content_type()
     88         if content_type != "application/pdf":
---> 89             raise NotImplementedError("File format not supported")
     90         f.write(obj.read())
     91     filepath = os.path.join(os.path.dirname(f.name), filename)

NotImplementedError: File format not supported

Steps to reproduce the bug run below code to reproduce the error.

tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')

Expected behavior

list of tables was expected

PDF

https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf

Screenshots

image

Environment

Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.8.2

also tried Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.9.0

and Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.11.0

bosd commented 3 months ago

Hey!

As https://github.com/camelot-dev/camelot/issues/343, we try to build a maintained fork at pypdf_table_extraction.

Can you check with the latest code over there if the issue still exsists? Please open a issue there if so.

jatinchhabriya commented 2 months ago

@MartinThoma @vinayak-mehta @bosd I am facing the same error as Kushal, Expected Output: List of tables Standard Output since this week: "Attribute Error: File Format not supported". Could you please let me know if a fix has been deployed on the forked branch, this was working a week ago and for my particular use case lattice boundary provided exclusively in camelot-py[cv] is required.

bosd commented 2 months ago

Could you please let me know if a fix has been deployed on the forked branch,

I assume the fork is ok. The tests are passing there.

Please test your use case with a fresh pip install of pypdf_table_extraction.

If that doesn't work. Please install from source from the main branch.

If you still encounter an error. Please open an issue on the new repo.

jatinchhabriya commented 2 months ago

@MartinThoma @bosd @vinayak-mehta Tried installing the main branch of forked branch as per your suggestion. Could you please add an example usage of how camelot has to be imported post installing pypdf-table-extraction via github main branch. Also added the issue to the forked branch, please tag the active maintainers https://github.com/py-pdf/pypdf_table_extraction/issues/63