Closed ashleo25 closed 4 years ago
@ashleo25 Are you seeing this warning message: "No image found in Figure."?
Assuming you are seeing the above message, yes this was caused as pdftotree currently does not produce <img>
elements.
https://github.com/HazyResearch/pdftotree/issues/88
This is a known issue and I'll be addressing in the near future.
If you can contribute a PR to fix https://github.com/HazyResearch/pdftotree/issues/88, that would be much appreciated.
As of the next release, pdftotree can extract images (JPEG & BMP) and embed them inline. https://github.com/HazyResearch/pdftotree/pull/99
@HiromuHota Thanks for taking this up
When are we planning this release if you can please share
@ashleo25 we could release pdftotree now, but I'd appreciate if you could test it.
You can install it from the git repository as: pip install git+https://github.com/HazyResearch/pdftotree.git
Let us know how it works for your use case.
Hi @HiromuHota
Thanks for it I checked the now i can have images in html , however color of the images needs to be seen I have one more questions how can i extract images from the pdf for e.g visualize=True gives me images but it gives me everything header , table i only want the embedded images /figures to be extracted from the pdf with no borders or tagging like its section, table etc is it possible ??
@ashleo25
Thanks for testing and confirming that you can have images in html.
visualize=True
is meant for debugging. If you want to extract images from the pdf instead of creating an html with images, the "pdfimages" of the poppler would be more suited for your use case.
@ashleo25 Also please open an issue at https://github.com/HazyResearch/pdftotree if your question/issue is related to pdftotree instead of fonduer.
Hi @HiromuHota
One more point all the images have black background in generated html how it can be changed?? Can i also get figure name as well when i run fonduer on this html via HTMLDOCPReprocesor
Hi
I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it
Please help/advice as i am stuck with this issue