allenai / pdffigures2

Given a scholarly PDF, extract figures, tables, captions, and section titles.
http://pdffigures2.allenai.org/
Apache License 2.0
611 stars 122 forks source link

For figure heavy book, unable to disambiguate caption candidates #53

Open snibbor opened 2 years ago

snibbor commented 2 years ago

Hello,

I started to play around with this tool on pathology books, and I noticed that for figure heavy books (little to no paragraph text, just figures/tables) that the algorithm currently cannot scrape the figures with captions.

Here is an example snapshot of a book I was testing (Differential Diagnosis in Surgical Pathology: Breast, Jean F. Simpson MD, Melinda E. Sanders MD): example_surg_path_breast

As you can see, the chapters start with a table and then proceed with just large figures with captions.

I am getting the "Unable to disambiguate caption candidates..." error for all the figures in this book.

I was wondering if you could give some tips on how to enhance/troubleshoot the code to work with books like this? I would really like to use this tool to scrape image-caption pairs from books like this if possible.

Thanks, Jack