Open conjuncts opened 2 months ago
Thanks for the great work and adding the caption functionality. It works well in most cases, but feels speed can be better indeed.
Is there a parameter that allows to specify :neighborhood when looking for captions? or are you using an automatic heuristic? I am asking because sometimes it seem to grab a few sentences that don't belong to table caption.
Speed should be much improved in c57a8a2 due to caching text and positions for Pypdfium2 (pymupdf seems to not have this problem). On my machine, this speeds up the captions tests from 4.28s → 2.47s. Sorry, I've been sitting on this fix for some time.
For captions, the current heuristic is this: (table_captioning._find_captions
)
This is what the current config options do:
line_spacing
: minimum threshold where vertical separation in text is considered the end of the caption. Decreasing this value might work
margin
: bbox in which to start looking for captions, but captions are allowed to leave this bbox.
Thus, perhaps I should add a parameter in which captions are required to be in?
Few questions / discussion points:
look for breaks in the text (which is defined to be 2.5 times the median height of text)
According to the code, it looks like you are calculating EMA of the word height, starting from the table's average word height? I think the initial estimate (tavg. table word height) can skew the stats, because caption font and spacing are often much different to the tables' ones.
Is it worth perhaps gathering stats starting from the first line below and above the table and then stop the caption once the gap deviates sufficiently from that stat?
Thus, perhaps I should add a parameter in which captions are required to be in?
That would be good, but it should be in some relative units rather than absolute height, e.g. maximum number of word sentences above and below?
In addition, it would be good to have separate fields for "above" and "below" content, rather than concatenating them together, since they are logically separate and it would be beneficial for RAG applications to keep them so?
Those are good points. The moving average was intended to take into consideration a difference between table and caption word height. But that's good point, I can calculate that directly. I can parameterize the hardcoded constant of 10 word-heights. captions() should already return a list[str] containing the caption above and caption below.
Not sure if it helps, but I tried the following simple approach using PyMuPDF, it worked reasonably well - calculating a normalized distance from text block above and below table (normalizing by average word height in the block that we are checking), and thresholding by some value. Basically it says that caption can't be further than N word heights of the current word line, because it would be visually too far...
def detect_caption(table_bbox: Tuple[float, float, float, float], block: Tuple[float, float, float, float, str], max_abs_dist: float = 2.5) -> Tuple[str,str]:
x1, y1, x2, y2 = block[:4]
text = block[4]
# Block in PyMupdf can consist of multiple lines of text
n_lines = text.count('\n') + 1
normalized_dist = 1000
top_caption, bottom_caption = "", ""
# Take care of captions above the table
if y2 < table_bbox[1]: # block in question is above the table
# Normalized distance = how many word "lines" this current sentence is from the table
normalized_dist = (y2-table_bbox[1])/((y2-y1) / n_lines)
if abs(normalized_dist) < max_abs_dist:
top_caption = block[4]
# Take care of captions below the table
elif y1 > table_bbox[3]: # block in question is below the table
normalized_dist = (y1-table_bbox[3])/((y2-y1)/n_lines)
if abs(normalized_dist) < max_abs_dist:
bottom_caption = block[4]
return top_caption, bottom_caption
# Extract captions using PyMuPDF, assumes we have table bbox
blocks = page.get_text_blocks()
top_captions, bottom_captions = [], []
for block in blocks:
top_cap, bottom_cap = detect_caption(table_bbox=table_bbox, block = block)
top_captions.append(top_cap)
bottom_captions.append(bottom_cap)
top_captions = [c for c in top_captions if c] # clear out empty captions
bottom_captions = [c for c in bottom_captions if c]
That is quite nice, I can try it out on the pdfs I have.
It does rely on pymupdf's block separation to delineate caption ending, which to my knowledge pypdfium2 does not have. It is tricky because I want to remain pdf parser agnostic.
I might need to look at some pdfs in particular, if you want you can submit through google forms? https://docs.google.com/forms/d/e/1FAIpQLSfeWd8LEN_GuVRUNIuWSnbC9RAr1TeUVDiUNdIgARtNpfS2ZA/viewform?usp=sf_link
I've submitted a link to a PDF containing a variety of different formats for captions (captions from the top, bottom, left), but also tricky to parse tables - containing hierarchical columns, hierarchical rows, sub-tables embedded in a single table etc. Hope that helps.
Thank you so much!
.captions()
is pretty slow: I estimate about 415 ms, which is much longer thandf()
.