chezou / tabula-py

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
MIT License
2.2k stars 300 forks source link

Add a way to set areas for non-existent pages in template #353

Closed ZeeD closed 1 year ago

ZeeD commented 1 year ago

Is your feature request related to a problem? Please describe.

I need to import a bunch of pdfs, and I'm using a template to set the interesting areas. Most of my documents are of 2 pages, but some of them have 3 pages and one is just one page long. Apart from the first page, the others have the same structure (basically the pdfs have some tabular data and may not fit in one or two pdf pages)

Describe the solution you'd like

Ideally I want to set up a single template, with the details I need for the first page, and set an area for pages 2 and 3 (or at most copy an area with a different page attribute). If I do so, now, however, read_pdf_with_template raise a CalledProcessError (basically because it tries to invoke tabula-1.0.5-jar-with-dependencies.jar on page 3 on a 2-page document)

I had a quick look at the jar --help but it seems to me there is no "ignore wrong pages". I've also tried to explicitly pass the number of pages, but it seems that, when using a template, the options are copied and passed to multiple invocation of the jar, resulting in applying the area of the template in page 1 on pages 1 and 2...

Describe alternatives you've considered

a workaround (that I don't really like) could be to prepare multiple templates, one for each "size" of the pdfs I need to import, but that also means I need to at least use another pdf library to get the page number of the pdfs and choose the "right" template

Additional context

chezou commented 1 year ago

Thanks for creating an issue.

If you want to use the same option to all pages, I would suggest to call tabula.template.load_template directly.

Here is the example:

>>> import tabula
>>> fname = "./tests/resources/data.tabula-template.json"
>>> o = tabula.template.load_template(fname)
>>> o
[TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=2, guess=True, area=[[123.999, 154.0, 210.444, 453.88], [410.996, 154.0, 497.441, 487.54]], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=3, guess=True, area=[123.999, 154.0, 322.899, 235.855], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)]
>>> o[0]
TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)
>>> o[0].pages
1
>>> o[0].pages="all"
>>> tabula.read_pdf(pdf_path, options=" ".join(o[0].build_option_list()))
'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug. 22, 2023 9:08:53 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 808 fonts

[             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2
5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1
6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0     3     4
7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0     4     2
9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0     4     4
10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0     4     4
11           Merc 450SE  16.4    8  275.8  180  3.07  4.070  17.40   0   0     3     3
12           Merc 450SL  17.3    8  275.8  180  3.07  3.730  17.60   0   0     3     3
13          Merc 450SLC  15.2    8  275.8  180  3.07  3.780  18.00   0   0     3     3
14   Cadillac Fleetwood  10.4    8  472.0  205  2.93  5.250  17.98   0   0     3     4
15  Lincoln Continental  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3     4
16    Chrysler Imperial  14.7    8  440.0  230  3.23  5.345  17.42   0   0     3     4
17             Fiat 128  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4     1
18          Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2
19       Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4     1
20        Toyota Corona  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3     1
21     Dodge Challenger  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3     2
22          AMC Javelin  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3     2
23           Camaro Z28  13.3    8  350.0  245  3.73  3.840  15.41   0   0     3     4
24     Pontiac Firebird  19.2    8  400.0  175  3.08  3.845  17.05   0   0     3     2
25            Fiat X1-9  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4     1
26        Porsche 914-2  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5     2
27         Lotus Europa  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
28       Ford Pantera L  15.8    8  351.0  264  4.22  3.170  14.50   0   1     5     4
29         Ferrari Dino  19.7    6  145.0  175  3.62  2.770  15.50   0   1     5     6
30        Maserati Bora  15.0    8  301.0  335  3.54  3.570  14.60   0   1     5     8
31           Volvo 142E  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2,    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa,    Unnamed: 0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0         145           6.7          3.3           5.7          2.5  virginica
1         146           6.7          3.0           5.2          2.3  virginica
2         147           6.3          2.5           5.0          1.9  virginica
3         148           6.5          3.0           5.2          2.0  virginica
4         149           6.2          3.4           5.4          2.3  virginica
5         150           5.9          3.0           5.1          1.8  virginica,      len supp  dose
0    4.2   VC   0.5
1   11.5   VC   0.5
2    7.3   VC   0.5
3    5.8   VC   0.5
4    6.4   VC   0.5
5   10.0   VC   0.5
6   11.2   VC   0.5
7   11.2   VC   0.5
8    5.2   VC   0.5
9    7.0   VC   0.5
10  16.5   VC   1.0
11  16.5   VC   1.0
12  15.2   VC   1.0
13  17.3   VC   1.0
14  22.5   VC   1.0]

Of course, there is room for improvement to pass TabulaOption to tabula.read_pdf directly, but before that, I'd love to hear your feedback.

chezou commented 1 year ago

Close since no response.

ZeeD commented 1 year ago

uuhhh.. sorry, I didn't reply sooner, but this is a hobby project I'm working on. While I understand your suggestion, this means that the template are not longer only defined in the json file, but explicitly manipulated... I think that at the moment I'll stuck with multiple templates and a simple logic to choose what to use for the extraction

chezou commented 1 year ago

Thanks for your response.

Unfortunately, tabula-py also doesn't know the page size of a PDF, so we can only use pages="all" option for handling unknown pages.