Closed wajdikhattel closed 4 years ago
Let me clarify how to reproduce the issue.
_parse_figure
parses <figure>
and <img>
.
In the case of <figure>
, Fonduer assumes that <figure>
has one or more of child <img>
like below:
<figure>
<img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
<figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
(Example from https://www.w3schools.com/tags/tag_figure.asp)
Correct if I'm wrong, but I think this issue happens when <figure>
has no child <img>
like below:
<figure>
<figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
Let me clarify how to reproduce the issue.
_parse_figure
parses<figure>
and<img>
. In the case of<figure>
, Fonduer assumes that<figure>
has one or more of child<img>
like below:<figure> <img src="pic_trulli.jpg" alt="Trulli" style="width:100%"> <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption> </figure>
(Example from https://www.w3schools.com/tags/tag_figure.asp)
Correct if I'm wrong, but I think this issue happens when
<figure>
has no child<img>
like below:<figure> <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption> </figure>
I just checked the HTML I'm working with and yes, you are correct.
It's actually a <figure bbox... ></figure>
somehow, and I'm actually using pdftotree for the pdf to html and my pdf contains some images.
I'll mention that in the steps to reproduce
Describe the bug When the html doesn't have an image, the
_parse_figure
from theparser.py
is not considering that in some cases theimgs
list could be empty.To Reproduce Steps to reproduce the behavior:
<figure></figure>
)Parser
with aHTMLDocPreprocessor
instanceParser.py (line 278 in apply)
Expected behavior It should exit the
_parse_figure
in case no figure was foundEnvironment (please complete the following information):