Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.47k stars 583 forks source link

Issue in partition_html and chunk_by_title #3168

Open pss-123 opened 3 weeks ago

pss-123 commented 3 weeks ago

As discussed in this Slack link, there are some issues in the way that html pages are partitioned and chunked.

Request the concerned development team to have a look into the issue.

Another feedback I would like to give is if unstructured.io team can come up with vision-based content extraction for html pages instead of rule-based, that would be very helpful in terms of extraction's accuracy and relevancy. LLMs nowadays are very proficient in capturing such details.

Thank you.

MthwRobinson commented 3 weeks ago

Hi @pss-123 - thanks for opening this issue. Could you attach an example file to this issue and describe the behavior you're observing?

Separately, if you'd like to process HTML documents visually, you can convert the HTML file to a PDF or an image and then process the converted file using the hi res strategy.

scanny commented 3 weeks ago

This is some sample HTML that will reproduce the issue:

<p>
  American football is by several measures the most popular spectator sport in the United States;
  <sup id="cite_ref-574" class="reference">
    <a href="#cite_note-574">[554]</a>
  </sup>
  the
  <a href="https://en.wikipedia.org/wiki/National_Football_League" title="National Football League">
    National Football League
  </a> has the highest average attendance of any sports league in the world, and the
  <a href="https://en.wikipedia.org/wiki/Super_Bowl" title="Super Bowl">Super Bowl</a> is watched
  by tens of millions globally.
  <sup id="cite_ref-575" class="reference">
    <a href="#cite_note-575">[555]</a>
  </sup>
  However, baseball has been regarded as the U.S.
  "<a href="https://en.wikipedia.org/wiki/National_sport" title="National sport">national sport</a>"
  since the late 19th century. After American football, the next four most popular professional
  team sports are basketball, baseball, soccer, and ice hockey. Their premier leagues are,
  respectively, the
  <a
    href="https://en.wikipedia.org/wiki/National_Basketball_Association"
    title="National Basketball Association"
  >
    National Basketball Association
  </a>,
  <a href="https://en.wikipedia.org/wiki/Major_League_Baseball" title="Major League Baseball">
    Major League Baseball
  </a>,
  <a
    href="https://en.wikipedia.org/wiki/Major_League_Soccer"
    title="Major League Soccer"
  >
    Major League Soccer
  </a>,
  and the
  <a href="https://en.wikipedia.org/wiki/National_Hockey_League" title="National Hockey League">
    National Hockey League
  </a>.
  The most-watched
  <a href="https://en.wikipedia.org/wiki/Individual_sport" title="Individual sport">
    individual sports
  </a>
  in the U.S. are
  <a
    href="https://en.wikipedia.org/wiki/Golf_in_the_United_States"
    title="Golf in the United States"
  >
    golf
  </a> and
  <a href="https://en.wikipedia.org/wiki/Auto_racing" title="Auto racing">auto racing</a>,
  particularly
  <a href="https://en.wikipedia.org/wiki/NASCAR" title="NASCAR">NASCAR</a>
  and
  <a href="https://en.wikipedia.org/wiki/IndyCar" title="IndyCar">IndyCar</a>.
  <sup id="cite_ref-576" class="reference"><a href="#cite_note-576">[556]</a></sup>
  <sup id="cite_ref-577" class="reference"><a href="#cite_note-577">[557]</a></sup>
</p>

produces:

HTMLText('[554]')
HTMLNarrativeText('National Football League\n   has the highest average attendance of any sports league in the world, and the')
HTMLNarrativeText('Super Bowl is watched\n  by tens of millions globally.')
HTMLText('[555]')
HTMLNarrativeText(
    'national sport"\n  since the late 19th century. After American football, the next four most popular'
    ' professional\n  team sports are basketball, baseball, soccer, and ice hockey. Their premier leagues'
    ' are,\n  respectively, the'
)
HTMLText('National Basketball Association\n  ,')
HTMLText('Major League Baseball\n  ,')
HTMLTitle('Major League Soccer\n  ,\n  and the')
HTMLTitle('National Hockey League\n  .\n  The most-watched')
HTMLNarrativeText('individual sports\n  \n  in the U.S. are')
HTMLTitle('golf\n   and')
HTMLTitle('auto racing,\n  particularly')
HTMLTitle('NASCAR\n  and')
HTMLTitle('IndyCar.')
HTMLText('[556]')
HTMLText('[557]')

Note also that the initial <p> element text "American football is by several measures ..." is dropped along with other portions of text like "However, baseball has been regarded in the U.S. ...".

pss-123 commented 3 weeks ago

Separately, if you'd like to process HTML documents visually, you can convert the HTML file to a PDF or an image and then process the converted file using the hi res strategy.

@MthwRobinson Thank you for providing this idea. I will give it a try for sure.