Open SeanIsYoung opened 3 months ago
@SeanIsYoung I'm betting this is down to us currently using the markdown
package for parsing MD into HTML.
Can you give a try converting the MD file to HTML with markdown-it and then partitioning that HTML with partition_html()
and see if that remedies the problem?
See this behavior that I believe is related: https://github.com/Unstructured-IO/unstructured/issues/3280
If swapping the converter works for this we can look at changing it out permanently.
pip install markdown-it-py==3.0.0
should get you the package if it's not installed already.
Unffortunately that doesn't work. I came across that issue and tried it, but there's no difference in the html returned.
print("\nmarkdown_html:")
html = markdown.markdown(loose_list, extensions=["tables"])
print(html)
print("\nmarkdown_it_html:")
md = MarkdownIt("commonmark", {"html": True}).enable("table")
html = md.render(loose_list)
print(html)
markdown_html:
<ol>
<li>
<p>list item one.</p>
</li>
<li>
<p>list item two.</p>
</li>
<li>
<p>list item three.</p>
</li>
</ol>
markdown_it_html:
<ol>
<li>
<p>list item one.</p>
</li>
<li>
<p>list item two.</p>
</li>
<li>
<p>list item three.</p>
</li>
</ol>
Ah, interesting. So the problem here looks to be as you observed initially, the "nesting" of a <p>
element within the <li>
element.
tl;dr is we can probably enhance the HTML parser to work in this case.
Both <li>
and <p>
are block-level (paragraph roughly speaking) HTML-elements, and normally each block-level HTML-element gets its own document-element. In this case, intern to the HTML parser, elements are produced both for the <li>
and the <p>
, but since the <li>
has no text of its own, that ListItem
element gets dropped.
In HTML, <li>
can serve both as a block-level element (when it contains only text) and as a block-item container when it contains other block items. You can see that behavior when a list contains multiple paragraphs. For example, the HTML:
<ul>
<li>
<p>First list-item paragraph</p>
<p>List-continuation paragraph</p>
</li>
<li>Second list-item</li>
</ul>
produces this rendering in the browser:
First list-item paragraph
List-continuation paragraph
The challenge is that an unstructured
document-element like ListItem
cannot contain "child" document elements. So there's no ready representation for a list-continuation paragraph.
What we can probably do though is make the HTML parser a little more sophisticated, such that when it contains exactly one block element, like the <p>
in this case, that it "adopts" that content as the content of the list-item.
Describe the bug I'm using loose lists in markdown (each item is separated by a blank line.) and the html parser fails to identify the list. Depending on the context it either categorises the elements as
Title
orNarrative Text
.Comparing the loose vs tight lists it seems it's got something to do with the paragraph tag but I'm not sure how exactly that's affecting the parser.
I can mostly work around this by parsring the markdown text first and remove any newlines that prepend a list item. Or by removing the paragraph tags from the html. But I'm not sure if either of those will run up against edge cases at some point so it would be nice if the html parser could handle this.
To Reproduce
Expected behavior Both the tight and the loose lists should result in the
ListItem
element.Screenshots
Environment Info
Additional context Add any other context about the problem here.