HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

No image in the child image node #437

Closed wajdikhattel closed 4 years ago

wajdikhattel commented 4 years ago

Describe the bug When the html doesn't have an image, the _parse_figure from the parser.py is not considering that in some cases the imgs list could be empty.

To Reproduce Steps to reproduce the behavior:

  1. Have a html with a figure object that doesn't have an image (eg. <figure></figure>)
  2. Run the Parser with a HTMLDocPreprocessor instance
  3. You will end up with the exception in Parser.py (line 278 in apply)

Expected behavior It should exit the _parse_figure in case no figure was found

Environment (please complete the following information):

HiromuHota commented 4 years ago

Let me clarify how to reproduce the issue.

_parse_figure parses <figure> and <img>. In the case of <figure>, Fonduer assumes that <figure> has one or more of child <img> like below:

<figure>
  <img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>

(Example from https://www.w3schools.com/tags/tag_figure.asp)

Correct if I'm wrong, but I think this issue happens when <figure> has no child <img> like below:

<figure>
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
wajdikhattel commented 4 years ago

Let me clarify how to reproduce the issue.

_parse_figure parses <figure> and <img>. In the case of <figure>, Fonduer assumes that <figure> has one or more of child <img> like below:

<figure>
  <img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>

(Example from https://www.w3schools.com/tags/tag_figure.asp)

Correct if I'm wrong, but I think this issue happens when <figure> has no child <img> like below:

<figure>
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>

I just checked the HTML I'm working with and yes, you are correct. It's actually a <figure bbox... ></figure> somehow, and I'm actually using pdftotree for the pdf to html and my pdf contains some images. I'll mention that in the steps to reproduce