atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Getting the title of a table by extracting the object and text closest to above the table #395

Open jrodioukova opened 4 years ago

jrodioukova commented 4 years ago

Hello,

I am using camelot to extract all tables from several PDFs. Camelot works well for table extraction, but I am having trouble extracting the table title (which usually appears as text right above the table, or sometimes below).

After searching online and StackOverflow, it seems the suggestion is to get access to the layout of the page: either by creating a Lattice() directly and using it's extract_tables method (and then accessing it's layout parameter) or by parsing the page with utils.get_page_layout()

However, both of these methods need to be passed a single page of the PDF. I am not sure how to do this. Would I need to split the PDF into single page PDF myself, or is there a better way?

Is there a way to get the layout of a specific page by giving the page number or (even better) for every page?

Thank you, Any help would be appreciated.

brifordwylie commented 3 years ago

Here's my hilariously bad implementation just so that someone can laugh and get inspired to do a better one and contribute to the great camelot package :)

Caveats:

# Helper methods for _bbox
def top_mid(bbox):
    return ((bbox[0]+bbox[2])/2, bbox[3])

def bottom_mid(bbox):
    return ((bbox[0]+bbox[2])/2, bbox[1])

def distance(p1, p2):
    return math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)

def get_closest_text(table, htext_objs):
    min_distance = 999  # Cause 9's are big :)
    best_guess = None
    table_mid = top_mid(table._bbox)  # Middle of the TOP of the table
    for obj in htext_objs:
        text_mid = bottom_mid(obj.bbox)  # Middle of the BOTTOM of the text
        d = distance(text_mid, table_mid)
        if d < min_distance:
            best_guess = obj.get_text().strip()
            min_distance = d
    return best_guess

def get_tables_and_titles(pdf_filename):
    """Here's my hacky code for grabbing tables and guessing at their titles"""
    my_handler = PDFHandler(pdf_filename)  # from camelot.handlers import PDFHandler
    tables = camelot.read_pdf(pdf_filename, pages='2,3,4')
    print('Extracting {:d} tables...'.format(tables.n))
    titles = []
    with camelot.utils.TemporaryDirectory() as tempdir:
        for table in tables:
            my_handler._save_page(pdf_filename, table.page, tempdir)
            tmp_file_path = os.path.join(tempdir, f'page-{table.page}.pdf')
            layout, dim = camelot.utils.get_page_layout(tmp_file_path)
            htext_objs = camelot.utils.get_text_objects(layout, ltype="horizontal_text")
            titles.append(get_closest_text(table, htext_objs))  # Might be None

    return titles, tables