Open pss-123 opened 3 weeks ago
Hi @pss-123 - thanks for opening this issue. Could you attach an example file to this issue and describe the behavior you're observing?
Separately, if you'd like to process HTML documents visually, you can convert the HTML file to a PDF or an image and then process the converted file using the hi res strategy.
This is some sample HTML that will reproduce the issue:
<p>
American football is by several measures the most popular spectator sport in the United States;
<sup id="cite_ref-574" class="reference">
<a href="#cite_note-574">[554]</a>
</sup>
the
<a href="https://en.wikipedia.org/wiki/National_Football_League" title="National Football League">
National Football League
</a> has the highest average attendance of any sports league in the world, and the
<a href="https://en.wikipedia.org/wiki/Super_Bowl" title="Super Bowl">Super Bowl</a> is watched
by tens of millions globally.
<sup id="cite_ref-575" class="reference">
<a href="#cite_note-575">[555]</a>
</sup>
However, baseball has been regarded as the U.S.
"<a href="https://en.wikipedia.org/wiki/National_sport" title="National sport">national sport</a>"
since the late 19th century. After American football, the next four most popular professional
team sports are basketball, baseball, soccer, and ice hockey. Their premier leagues are,
respectively, the
<a
href="https://en.wikipedia.org/wiki/National_Basketball_Association"
title="National Basketball Association"
>
National Basketball Association
</a>,
<a href="https://en.wikipedia.org/wiki/Major_League_Baseball" title="Major League Baseball">
Major League Baseball
</a>,
<a
href="https://en.wikipedia.org/wiki/Major_League_Soccer"
title="Major League Soccer"
>
Major League Soccer
</a>,
and the
<a href="https://en.wikipedia.org/wiki/National_Hockey_League" title="National Hockey League">
National Hockey League
</a>.
The most-watched
<a href="https://en.wikipedia.org/wiki/Individual_sport" title="Individual sport">
individual sports
</a>
in the U.S. are
<a
href="https://en.wikipedia.org/wiki/Golf_in_the_United_States"
title="Golf in the United States"
>
golf
</a> and
<a href="https://en.wikipedia.org/wiki/Auto_racing" title="Auto racing">auto racing</a>,
particularly
<a href="https://en.wikipedia.org/wiki/NASCAR" title="NASCAR">NASCAR</a>
and
<a href="https://en.wikipedia.org/wiki/IndyCar" title="IndyCar">IndyCar</a>.
<sup id="cite_ref-576" class="reference"><a href="#cite_note-576">[556]</a></sup>
<sup id="cite_ref-577" class="reference"><a href="#cite_note-577">[557]</a></sup>
</p>
produces:
HTMLText('[554]')
HTMLNarrativeText('National Football League\n has the highest average attendance of any sports league in the world, and the')
HTMLNarrativeText('Super Bowl is watched\n by tens of millions globally.')
HTMLText('[555]')
HTMLNarrativeText(
'national sport"\n since the late 19th century. After American football, the next four most popular'
' professional\n team sports are basketball, baseball, soccer, and ice hockey. Their premier leagues'
' are,\n respectively, the'
)
HTMLText('National Basketball Association\n ,')
HTMLText('Major League Baseball\n ,')
HTMLTitle('Major League Soccer\n ,\n and the')
HTMLTitle('National Hockey League\n .\n The most-watched')
HTMLNarrativeText('individual sports\n \n in the U.S. are')
HTMLTitle('golf\n and')
HTMLTitle('auto racing,\n particularly')
HTMLTitle('NASCAR\n and')
HTMLTitle('IndyCar.')
HTMLText('[556]')
HTMLText('[557]')
Note also that the initial <p>
element text "American football is by several measures ..." is dropped along with other portions of text like "However, baseball has been regarded in the U.S. ...".
Separately, if you'd like to process HTML documents visually, you can convert the HTML file to a PDF or an image and then process the converted file using the hi res strategy.
@MthwRobinson Thank you for providing this idea. I will give it a try for sure.
As discussed in this Slack link, there are some issues in the way that html pages are partitioned and chunked.
Request the concerned development team to have a look into the issue.
Another feedback I would like to give is if unstructured.io team can come up with vision-based content extraction for html pages instead of rule-based, that would be very helpful in terms of extraction's accuracy and relevancy. LLMs nowadays are very proficient in capturing such details.
Thank you.