Open nickphilip opened 5 months ago
@nickphilip - Thanks for reporting this, we'll take a look as soon as we're able.
@MthwRobinson Turns out this behavior is produced by the markdown
package, which is what partition_md()
uses to convert Markdown to HTML.
md_text = """
My list
- item 1
- item 2
- item 3
"""
>>> markdown.markdown(md_text)
<p>My list
- item 1
- item 2
- item 3</p>
Adding a blank line before the list results in the expected behavior:
md_text = """
My list
- item 1
- item 2
- item 3
"""
>>> markdown.markdown(md_text)
<p>My list</p>
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
The "official" John Gruber spec and reference implementation specifies this behavior, that a list requires a blank line before it. The gist is that a word-wrapped line that just happens to start with a "-" or "*" should not be interpreted as a list item, like:
The computed value of 7
* 3 equals 21.
In practice, Markdown parsers are more clever about this and are able to distinguish author intent reliably without the blank line before the list.
Note that using the markdown_it
package (already a dependency I believe, possibly for ingest), we get a "correct" conversion with or without the blank line.
My general impression, after poking around a bit, is that the markdown
package is overly dogmatic, like adhering closely to the "original" standard from 2005 or so. I think what we care about is that the maximum number of legitimate Markdown files parse as they would be expected to, like how they appear on GitHub or whatever other site they came from.
This "defacto-standard" behavior is what the CommonMark standard articulates and markdown_it
is CommonMark compliant. It's also somewhat more modern, being only 1 decade old rather than 2 :)
I'm inclined to think we'd be better off in general using the markdown_it
package since I believe it would parse more Markdown documents the way the author intended and the way they appear when rendered on their source.
@nickphilip In the meantime, a workaround for this would be:
from markdown_it import MarkdownIt
md = MarkdownIt("commonmark", {"html": True}).enable("table")
html_text = md.render(md_text)
elements = partition_html(text=html_text)
Basically converting the MD to HTML yourself and then using partition_html()
on the result.
Btw, @nickphilip what was the provenance of that problematic Markdown?
Is that something you just typed in by hand for experimentation purposes or did you have an actual "mis"-partitioning from a document from say StackOverflow or something?
@scanny - I found the issue testing unstructured.io against Markdown documents from public repos, as I noticed that my RAG app was not getting the expected context.
Great assessment of the root cause. Your workaround makes sense, but the fix would be preferable if one uses the runner (pipeline). Thanks!
@nickphilip when you say "public repos", does that imply this would be GFM (GitHub Flavored markdown) you're parsing?
@MthwRobinson After noodling this a bit and a little more research, I think there are four possible courses:
I think the right answer is 2 (CommonMark).
<br>
element (the way GFM does)I'm not sure there's a perfect answer, but I think CommonMark provides the broadest coverage for what we care about when partitioning and so is a good step in the right direction. I think we'd all prefer to avoid a new partitioning option until we can show a compelling use case that requires it.
I noticed that when the number of characters per line is very short in a list block in a Markdown document, the list is identified as a
Title
instead of aNarrativeText
.It prevents the chunking by title to work properly afterwards as I expect all content under a Markdown header to become a
CompositeText
to be indexed with an embedding function.It is reproducible with the following python code:
The moment any line of the list block (including the intro line) has more characters, the list block is then properly recognized as a
NarrativeText
.Package version: unstructured 0.14.5