HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

unable to read images in the pdf file #525

Closed ashleo25 closed 4 years ago

ashleo25 commented 4 years ago

Hi

I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it

Please help/advice as i am stuck with this issue

HiromuHota commented 4 years ago

@ashleo25 Are you seeing this warning message: "No image found in Figure."?

Assuming you are seeing the above message, yes this was caused as pdftotree currently does not produce <img> elements. https://github.com/HazyResearch/pdftotree/issues/88 This is a known issue and I'll be addressing in the near future. If you can contribute a PR to fix https://github.com/HazyResearch/pdftotree/issues/88, that would be much appreciated.

HiromuHota commented 4 years ago

As of the next release, pdftotree can extract images (JPEG & BMP) and embed them inline. https://github.com/HazyResearch/pdftotree/pull/99

ashleo25 commented 4 years ago

@HiromuHota Thanks for taking this up

When are we planning this release if you can please share

HiromuHota commented 4 years ago

@ashleo25 we could release pdftotree now, but I'd appreciate if you could test it. You can install it from the git repository as: pip install git+https://github.com/HazyResearch/pdftotree.git Let us know how it works for your use case.

ashleo25 commented 3 years ago

Hi @HiromuHota

Thanks for it I checked the now i can have images in html , however color of the images needs to be seen I have one more questions how can i extract images from the pdf for e.g visualize=True gives me images but it gives me everything header , table i only want the embedded images /figures to be extracted from the pdf with no borders or tagging like its section, table etc is it possible ??

HiromuHota commented 3 years ago

@ashleo25 Thanks for testing and confirming that you can have images in html. visualize=True is meant for debugging. If you want to extract images from the pdf instead of creating an html with images, the "pdfimages" of the poppler would be more suited for your use case.

HiromuHota commented 3 years ago

@ashleo25 Also please open an issue at https://github.com/HazyResearch/pdftotree if your question/issue is related to pdftotree instead of fonduer.

ashleo25 commented 3 years ago

Hi @HiromuHota

One more point all the images have black background in generated html how it can be changed?? Can i also get figure name as well when i run fonduer on this html via HTMLDOCPReprocesor