Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.31k stars 773 forks source link

feat(html): list-item adopts single child block element as its text #3499

Open SeanIsYoung opened 3 months ago

SeanIsYoung commented 3 months ago

Describe the bug I'm using loose lists in markdown (each item is separated by a blank line.) and the html parser fails to identify the list. Depending on the context it either categorises the elements as Title or Narrative Text.

Comparing the loose vs tight lists it seems it's got something to do with the paragraph tag but I'm not sure how exactly that's affecting the parser.

I can mostly work around this by parsring the markdown text first and remove any newlines that prepend a list item. Or by removing the paragraph tags from the html. But I'm not sure if either of those will run up against edge cases at some point so it would be nice if the html parser could handle this.

To Reproduce

import markdown
from unstructured.partition.md import partition_md
from unstructured.partition.html import partition_html

loose_list = """
1. list item one.

2. list item two.

3. list item three.
"""

tight_list = """
1. list item one.
2. list item two.
3. list item three.
"""

print("markdown_loose:")
elements = partition_md(text=loose_list)
for el in elements:
    print(f"{el.category}: {el}")

print("\nmarkdown_tight:")
elements = partition_md(text=tight_list)
for el in elements:
    print(f"{el.category}: {el}")

print("\nhtml_loose:")
html = markdown.markdown(loose_list, extensions=["tables"])
print(html)

print("\nhtml_tight:")
html = markdown.markdown(tight_list, extensions=["tables"])
print(html)

Expected behavior Both the tight and the loose lists should result in the ListItem element.

Screenshots image

Environment Info

unstructured/scripts/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version:  3.11.9
unstructured version:  0.15.1
unstructured-inference version:  0.7.36
pytesseract version:  0.3.10
Torch version:  2.3.1.post300
Detectron2 version:  0.6
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 24.2.5.2 420(Build:2)

Additional context Add any other context about the problem here.

scanny commented 3 months ago

@SeanIsYoung I'm betting this is down to us currently using the markdown package for parsing MD into HTML.

Can you give a try converting the MD file to HTML with markdown-it and then partitioning that HTML with partition_html() and see if that remedies the problem?

See this behavior that I believe is related: https://github.com/Unstructured-IO/unstructured/issues/3280

If swapping the converter works for this we can look at changing it out permanently.

pip install markdown-it-py==3.0.0 should get you the package if it's not installed already.

SeanIsYoung commented 3 months ago

Unffortunately that doesn't work. I came across that issue and tried it, but there's no difference in the html returned.

print("\nmarkdown_html:")
html = markdown.markdown(loose_list, extensions=["tables"])
print(html)

print("\nmarkdown_it_html:")
md = MarkdownIt("commonmark", {"html": True}).enable("table")
html = md.render(loose_list)
print(html)
markdown_html:
<ol>
<li>
<p>list item one.</p>
</li>
<li>
<p>list item two.</p>
</li>
<li>
<p>list item three.</p>
</li>
</ol>

markdown_it_html:
<ol>
<li>
<p>list item one.</p>
</li>
<li>
<p>list item two.</p>
</li>
<li>
<p>list item three.</p>
</li>
</ol>
scanny commented 3 months ago

Ah, interesting. So the problem here looks to be as you observed initially, the "nesting" of a <p> element within the <li> element.

tl;dr is we can probably enhance the HTML parser to work in this case.


Both <li> and <p> are block-level (paragraph roughly speaking) HTML-elements, and normally each block-level HTML-element gets its own document-element. In this case, intern to the HTML parser, elements are produced both for the <li> and the <p>, but since the <li> has no text of its own, that ListItem element gets dropped.

In HTML, <li> can serve both as a block-level element (when it contains only text) and as a block-item container when it contains other block items. You can see that behavior when a list contains multiple paragraphs. For example, the HTML:

<ul>
  <li>
    <p>First list-item paragraph</p>
    <p>List-continuation paragraph</p>
  </li>
  <li>Second list-item</li>
</ul>

produces this rendering in the browser:

The challenge is that an unstructured document-element like ListItem cannot contain "child" document elements. So there's no ready representation for a list-continuation paragraph.

What we can probably do though is make the HTML parser a little more sophisticated, such that when it contains exactly one block element, like the <p> in this case, that it "adopts" that content as the content of the list-item.