CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
483 stars 99 forks source link

Add pdfimages support for image extraction? #34

Closed axfelix closed 7 years ago

axfelix commented 7 years ago

Hi,

Are there plans to add a call to pdfimages (from xpdf/poppler) to ensure images are extracted when parsing full text via Grobid? pdfimages accuracy and performance seems to be very good but I don't think it's directly used by any pdf parsers currently.

dtkaczyk commented 7 years ago

Hi @axfelix CERMINE and GROBID are two different and separate projects, please use GROBID's issues for this.

axfelix commented 7 years ago

Oh, sorry, I actually meant to say Cermine. That's embarrassing -- I was thinking of Grobid earlier this morning before I opened the issue.

dtkaczyk commented 7 years ago

Ah, ok then. We actually use iText library to parse PDF stream, not Poppler. I believe, however, iText also has support for extracting images, so this might be possible.

Could you describe in more detail what is the use case? In particular what would you like to obtain on the output? Just a set of images extracted from a PDF file, or more information about them?

axfelix commented 7 years ago

Sure -- the use case is getting <fig> elements in the output, providing a relative link to a .png file that is produced in the same output directory as the XML. Right now, for us to do this, we need to run pdfimages on top of Cermine or Grobid and add all of the <fig> elements to the end of the article body just to get them in there at all.

dtkaczyk commented 7 years ago

Ok, I will take a closer look at this and I'll get back to you when I know more.

dtkaczyk commented 7 years ago

@axfelix I finally found time to look at this :) It seems fairly easy to extract the images and add relative links at the end of the article body, as you described. Extracting the right captions, however, is not as trivial and would require more work and time. Do you need the captions? Would the images only without the captions be helpful as well?

axfelix commented 7 years ago

hi Dominika,

The images without the captions would still be very useful -- the captions are of interest, but not needing to call an external library to hack the JATS afterward in order to preserve the images is a priority.

dtkaczyk commented 7 years ago

@axfelix I implemented extracting images in "images_extraction" branch. Would you be interested in testing it and providing a feedback?

It requires building the code from the branch, but this should be straightforward (more information in the main README). Images are extracted by ContentExtractor class by default, it should suffice to provide the path to PDFs using -path option (again, the extraction command from the README should suffice).

One thing I noticed: from some PDFs a lot of 1x1 pixel images are extracted (dots), I also saw some horizontal lines as well in some cases (images with 1-pixel height). Do you think the code should filter those out?

axfelix commented 7 years ago

Wow, thanks for the quick implementation, Dominika! Built and tested and appears to be working great -- this saves us an additional library call and is hugely appreciated.

I'd be in favour of filtering out images that have a 1-pixel height or width; we can do this with an additional imagemagick pass but I don't see much reason not to do it by default.

dtkaczyk commented 7 years ago

Great. I've added filtering out 1-pixel height or width images and merged everything into master. For now I am closing this issue, if you find any problems or bugs, it can be reopened.

axfelix commented 7 years ago

No need to reopen, but curious: when is your next release scheduled?

dtkaczyk commented 7 years ago

There is no exact date, most likely in a few weeks.