ashima / pdf-table-extract

Extract tables from PDF pages.
MIT License
280 stars 94 forks source link

Consider merging with Pandas #6

Closed cancan101 closed 11 years ago

cancan101 commented 11 years ago

Consider adding pdf-table-extract as source in Pandas lib. Currently pandas supports using a number of external libs for data I/O.

For example see: https://github.com/pydata/pandas/issues/4556

jtratner commented 11 years ago

looks like this uses numpy at various points. If you had a func that outputted to numpy arrays, that would make it simple to integrate with pandas.

ijm commented 11 years ago

Apologies for the lack of response. We are looking at the possibility of providing a more useful output for Pandas but it's not obvious how to do that from the output we currently generate.

This code is really good at extracting data cells from a well delineated PDF table, but it doesn't understand overall table structure very well. For example, it doesn't track any hierarchical information so the entire page, within the cropping rectangle, is a table (with maybe just big cells around the outside). This means it cannot handle multiple tables in one page cleanly, or provide hints on headings versus contents. The numpy arrays used internally don't looks like the table: they are either the grid structure or a flattened database of cell location and contents. It would be easy to copy and modify the function o_table_csv() to produce a 3D numpy array rather than a list of list, but it would lose cell size information and it isn't clear how to get that into pandas because of the problem laid out in the next paragraph. Really I need to take some time to figure out how Pandas stores structures internally and output a file that can be re-read without loss of information.

There are way too many style differences in PDF tables to reliably discover the format automatically without input from the user. For example, in the Pandas issue (https://github.com/pydata/pandas/issues/4556) there are 3 examples. The first (supermarkets) table should be parseable, but the other two are missing dividers that we use to delineate cells. There's no way to parse those files into a Pandas structure without manual intervention (to add lines to the PDF before parsing). Also many PDFs don't render correctly at some resolutions and the delimiters disappear. This is the reason for all the -check options and then colourized table_chtml, and -colmult options; these are not my debugging tools, they are for you! So turning this extraction into an function embedded in another app is problematic. It really was intended as one step in a pipeline that can be tweaked independently as needed.

Lastly there are licensing issues. While I try to release everything with a MIT Expat license, most of the rendering and text parsing is done using the Poppler library tools, which is GPLed in full virus mode. This means that the only access I can use to that library to keep it quarantined is via a command-line tool. Right now we make a miriad of individually forked calls to the pdfto* executables which slows everything down. Writing a wrapper (with a GPL license) that can take multiple requests is on the TODO list, and will improve performance greatly. Even so one has to be careful about what licenses apply where.

jtratner commented 11 years ago

Hey, I'm a pandas dev and I think we're on the same page :smile:. I don't think want to incorporate this directly (exactly for the reasons that you cite and particularly if there are difficult deps). But we could write up a wrapper that uses this library to parse PDF (just like we currently do with Excel and other libraries).

Fundamentally, we can only do as well as your library can, right? (and that's better than nothing!) For example, we have a read_html function that can handle many kinds of tables, but doesn't try to deal with malformed tables, etc.

That said, if you can either produce a list of lists or a list of ndarrays (or an ndarray of ndarrays), it's very simple to incorporate into pandas. (we'd just pass it to our underlying TextParser and do whatever type inference user requests). ping me if you're interested.

jtratner commented 11 years ago

Sorry I'm rereading your comment some more - I skimmed it once and then realized I missed some things. Pandas can handle heterogeneous data types and is pretty good at figuring out how to take text input (csv, excel, html, json) and convert it correctly (including converting types correctly). Pandas can handle hierarchical columns and indices. If you want to do a really simple interface, you could just transform a cell that's of size 5 into the same value repeated 5 times. (and then the user can decide whether they want it as a MultiIndex (hierarchical columns or indices) or leave it as is)

cancan101 commented 11 years ago

The read_html example is great. It definitely requires user tweaking and intervention to get the desired results. For example the ordering of the tables returned isn't event guaranteed. On Oct 1, 2013 9:13 PM, "Jeff Tratner" notifications@github.com wrote:

Hey, I'm a pandas dev and I think we're on the same page [image: :smile:]. I don't think want to incorporate this directly (exactly for the reasons that you cite and particularly if there are difficult deps). But we could write up a wrapper that uses this library to parse PDF (just like we currently do with Excel and other libraries).

Fundamentally, we can only do as well as your library can, right? (and that's better than nothing!) For example, we have a read_html function that can handle many kinds of tables, but doesn't try to deal with malformed tables, etc.

That said, if you can either produce a list of lists or a list of ndarrays (or an ndarray of ndarrays), it's very simple to incorporate into pandas. (we'd just pass it to our underlying TextParser and do whatever type inference user requests). ping me if you're interested.

— Reply to this email directly or view it on GitHubhttps://github.com/ashima/pdf-table-extract/issues/6#issuecomment-25506660 .

jtratner commented 11 years ago

Heck, you could even do:

read_csv(get_csv_from_pdf_extract(somepdf))

eelsirhc commented 11 years ago

The interface needs to be improved, but I created an installable/importable module out of the extracttab.py script ( https://github.com/eelsirhc/pdf-table-extract )

After installing you should be able to

import pdftableextract as pdf
import pandas as pd
...
cells = pdf.process_page(pages)
list_data = pdf.table_to_list(cells, pages)
data = pd.DataFrame(list_data[1:], columns=list_data[0])
...

The process still requires some tuning to extract the correct table. Have a look at https://github.com/eelsirhc/pdf-table-extract/blob/master/example/test_to_pandas.py for the 'Walmart' example from https://github.com/pydata/pandas/issues/4556

(table_to_list is effectively the read_csv(get_csv_from_pdf_extract()) above, but because it never goes through a csv parser numbers don't get parsed into float/ints)

jtratner commented 11 years ago

If you were willing to expose underlying ndarray, it'd require less memory, but given that it's a PDF, I'd guess that's less of a concern (can't imagine a PDF table having more than 1-2K rows). Can you give me a heads up about how you handle spanning cells? It might be easy to convert them to a pandas MultiIndex (which is somewhat equivalent).

cancan101 commented 11 years ago

@cpcloud In case you are interested in this thread.

ijm commented 11 years ago

So we've moved a few things around, and packaged things up.

Fundamentally, we can only do as well as your library can, right? (and that's better than nothing!)

I've updated the TODO.md file, which you should take a read of to get a better idea of what I think is broken and what needs improving and why; there is a lot this code cannot do :p

So as Chris describes above, we've broken the code out so the bits can be used much more cleanly. There is now:

If you were willing to expose underlying ndarray, ...

There isn't really a underlying tabular array to expose. The numpy arrays you see in the code only contain potential location and shape information, not the cell contents.

Can you give me a heads up about how you handle spanning cells?

The TODO file under 'cell finding' describes the algorithm, ( odd place I know :) ) but the primary return type from process_page() is the cell-location-contents structure. Its is list of tuples: (x-location, y-location, cell-width, cell-height, page, cell-contents). The function table_to_list() and the emitters that output tables unflatten this into a 3D grid putting the contents in the cell ( page, x-location, y-location) and ignoring the widths if there isn't an easy way to include them. I guess we could add an option that to replicate a cells contents into all of the cells covered, but that is really part of the divining-more-structure discussion (again in the TODO file).

I'm hesitant to return a Pandas data structure from the core code, because I'm concerned about introducing a cross dependency, just for a 5 or 6 line shim, but i'm happy to move things around to make that shim as small as possible!

jtratner commented 11 years ago

Interesting. When I have more time I'll take a look.

I don't think it makes sense for this to return a pandas structure either. I think we can start by creating a list of lists and go from there.

On Wed, Oct 2, 2013 at 9:30 PM, Ian McEwan notifications@github.com wrote:

So we've moved a few things around, and packaged things up.

Fundamentally, we can only do as well as your library can, right? (and that's better than nothing!)

I've updated the TODO.md file, which you should take a read of to get a better idea of what I think is broken and what needs improving and why; there is a lot this code cannot do :p

So as Chris describes above, we've broken the code out so the bits can be used much more cleanly. There is now:

  • a process_page() function that doesn't use argparse, and will return the location-contents data-structure,
  • emitters that will take that structure and output the formats we did before,
  • and a simple shim that will return a 3D gridded list (page, column, row).
  • a command line wrapper that calls the above.

If you were willing to expose underlying ndarray, ...

There isn't really a underlying tabular array to expose. The numpy arrays you see in the code only contain potential location and shape information, not the cell contents.

Can you give me a heads up about how you handle spanning cells?

The TODO file under 'cell finding' describes the algorithm, ( odd place I know :) ) but the primary return type from process_page() is the cell-location-contents structure. Its is list of tuples: (x-location, y-location, cell-width, cell-height, page, cell-contents). The function table_to_list() and the emitters that output tables unflatten this into a 3D grid putting the contents in the cell ( page, x-location, y-location) and ignoring the widths if there isn't an easy way to include them. I guess we could add an option that to replicate a cells contents into all of the cells covered, but that is really part of the divining-more-structure discussion (again in the TODO file).

I'm hesitant to return a Pandas data structure from the core code, because I'm concerned about introducing a cross dependency, just for a 5 or 6 line shim, but i'm happy to move things around to make that shim as small as possible!

— Reply to this email directly or view it on GitHubhttps://github.com/ashima/pdf-table-extract/issues/6#issuecomment-25591102 .