Closed axfelix closed 7 years ago
Hi @axfelix CERMINE and GROBID are two different and separate projects, please use GROBID's issues for this.
Oh, sorry, I actually meant to say Cermine. That's embarrassing -- I was thinking of Grobid earlier this morning before I opened the issue.
Ah, ok then. We actually use iText library to parse PDF stream, not Poppler. I believe, however, iText also has support for extracting images, so this might be possible.
Could you describe in more detail what is the use case? In particular what would you like to obtain on the output? Just a set of images extracted from a PDF file, or more information about them?
Sure -- the use case is getting <fig>
elements in the output, providing a relative link to a .png file that is produced in the same output directory as the XML. Right now, for us to do this, we need to run pdfimages on top of Cermine or Grobid and add all of the <fig>
elements to the end of the article body just to get them in there at all.
Ok, I will take a closer look at this and I'll get back to you when I know more.
@axfelix I finally found time to look at this :) It seems fairly easy to extract the images and add relative links at the end of the article body, as you described. Extracting the right captions, however, is not as trivial and would require more work and time. Do you need the captions? Would the images only without the captions be helpful as well?
hi Dominika,
The images without the captions would still be very useful -- the captions are of interest, but not needing to call an external library to hack the JATS afterward in order to preserve the images is a priority.
@axfelix I implemented extracting images in "images_extraction" branch. Would you be interested in testing it and providing a feedback?
It requires building the code from the branch, but this should be straightforward (more information in the main README). Images are extracted by ContentExtractor class by default, it should suffice to provide the path to PDFs using -path
option (again, the extraction command from the README should suffice).
One thing I noticed: from some PDFs a lot of 1x1 pixel images are extracted (dots), I also saw some horizontal lines as well in some cases (images with 1-pixel height). Do you think the code should filter those out?
Wow, thanks for the quick implementation, Dominika! Built and tested and appears to be working great -- this saves us an additional library call and is hugely appreciated.
I'd be in favour of filtering out images that have a 1-pixel height or width; we can do this with an additional imagemagick pass but I don't see much reason not to do it by default.
Great. I've added filtering out 1-pixel height or width images and merged everything into master. For now I am closing this issue, if you find any problems or bugs, it can be reopened.
No need to reopen, but curious: when is your next release scheduled?
There is no exact date, most likely in a few weeks.
Hi,
Are there plans to add a call to pdfimages (from xpdf/poppler) to ensure images are extracted when parsing full text via Grobid? pdfimages accuracy and performance seems to be very good but I don't think it's directly used by any pdf parsers currently.