conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
254 stars 17 forks source link

Captions improvements #8

Open conjuncts opened 2 months ago

conjuncts commented 2 months ago

.captions() is pretty slow: I estimate about 415 ms, which is much longer than df().

snexus commented 1 month ago

Thanks for the great work and adding the caption functionality. It works well in most cases, but feels speed can be better indeed.

Is there a parameter that allows to specify :neighborhood when looking for captions? or are you using an automatic heuristic? I am asking because sometimes it seem to grab a few sentences that don't belong to table caption.

conjuncts commented 1 month ago

Speed should be much improved in c57a8a2 due to caching text and positions for Pypdfium2 (pymupdf seems to not have this problem). On my machine, this speeds up the captions tests from 4.28s → 2.47s. Sorry, I've been sitting on this fix for some time.

For captions, the current heuristic is this: (table_captioning._find_captions)

  1. Get the words before and after the table in the reading order
  2. Place them into 2 available slots: one above and one below.
  3. If one slot is still empty, look for words closest to the top/bottom of the table
  4. To determine when to end the caption: a. look for breaks in the text (which is defined to be 2.5 times the median height of text) b. Also end the caption if the caption grows to more than 10 word-heights

This is what the current config options do: line_spacing: minimum threshold where vertical separation in text is considered the end of the caption. Decreasing this value might work margin: bbox in which to start looking for captions, but captions are allowed to leave this bbox.

Thus, perhaps I should add a parameter in which captions are required to be in?

snexus commented 1 month ago

Few questions / discussion points:

look for breaks in the text (which is defined to be 2.5 times the median height of text)

According to the code, it looks like you are calculating EMA of the word height, starting from the table's average word height? I think the initial estimate (tavg. table word height) can skew the stats, because caption font and spacing are often much different to the tables' ones.

Is it worth perhaps gathering stats starting from the first line below and above the table and then stop the caption once the gap deviates sufficiently from that stat?

Thus, perhaps I should add a parameter in which captions are required to be in?

That would be good, but it should be in some relative units rather than absolute height, e.g. maximum number of word sentences above and below?

In addition, it would be good to have separate fields for "above" and "below" content, rather than concatenating them together, since they are logically separate and it would be beneficial for RAG applications to keep them so?

conjuncts commented 1 month ago

Those are good points. The moving average was intended to take into consideration a difference between table and caption word height. But that's good point, I can calculate that directly. I can parameterize the hardcoded constant of 10 word-heights. captions() should already return a list[str] containing the caption above and caption below.

snexus commented 1 month ago

Not sure if it helps, but I tried the following simple approach using PyMuPDF, it worked reasonably well - calculating a normalized distance from text block above and below table (normalizing by average word height in the block that we are checking), and thresholding by some value. Basically it says that caption can't be further than N word heights of the current word line, because it would be visually too far...

def detect_caption(table_bbox: Tuple[float, float, float, float], block: Tuple[float, float, float, float, str], max_abs_dist: float = 2.5) -> Tuple[str,str]:
    x1, y1, x2, y2 = block[:4]
    text = block[4]

    # Block in PyMupdf can consist of multiple lines of text
    n_lines = text.count('\n') + 1

    normalized_dist = 1000
    top_caption, bottom_caption = "", ""

    # Take care of captions above the table
    if y2 < table_bbox[1]: # block in question is above the table
        # Normalized distance = how many word "lines" this current sentence is from the table
        normalized_dist =  (y2-table_bbox[1])/((y2-y1) / n_lines)
        if abs(normalized_dist) < max_abs_dist:
            top_caption = block[4]

    # Take care of captions below the table
    elif y1 > table_bbox[3]: # block in question is below the table
        normalized_dist =  (y1-table_bbox[3])/((y2-y1)/n_lines)
        if abs(normalized_dist) < max_abs_dist:
            bottom_caption = block[4]
    return top_caption, bottom_caption

# Extract captions using PyMuPDF, assumes we have table bbox
blocks = page.get_text_blocks()

top_captions, bottom_captions = [], []
for block in blocks:
    top_cap, bottom_cap = detect_caption(table_bbox=table_bbox, block = block)
    top_captions.append(top_cap)
    bottom_captions.append(bottom_cap)

top_captions = [c for c in top_captions if c] # clear out empty captions
bottom_captions = [c for c in bottom_captions if c]
conjuncts commented 1 month ago

That is quite nice, I can try it out on the pdfs I have.

It does rely on pymupdf's block separation to delineate caption ending, which to my knowledge pypdfium2 does not have. It is tricky because I want to remain pdf parser agnostic.

I might need to look at some pdfs in particular, if you want you can submit through google forms? https://docs.google.com/forms/d/e/1FAIpQLSfeWd8LEN_GuVRUNIuWSnbC9RAr1TeUVDiUNdIgARtNpfS2ZA/viewform?usp=sf_link

snexus commented 1 month ago

I've submitted a link to a PDF containing a variety of different formats for captions (captions from the top, bottom, left), but also tricky to parse tables - containing hierarchical columns, hierarchical rows, sub-tables embedded in a single table etc. Hope that helps.

conjuncts commented 1 month ago

Thank you so much!