Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.61k stars 704 forks source link

List block in a partitioned Markdown doc identified as a `Title` element under special conditions #3280

Open nickphilip opened 3 months ago

nickphilip commented 3 months ago

I noticed that when the number of characters per line is very short in a list block in a Markdown document, the list is identified as a Title instead of a NarrativeText.

It prevents the chunking by title to work properly afterwards as I expect all content under a Markdown header to become a CompositeText to be indexed with an embedding function.

It is reproducible with the following python code:

from unstructured.partition.md import partition_md

md_text = """
# header 1
## header 2
My list
- item 1
- item 2
- item 3
- item 4
- item 5
## header 3
"""

elements = partition_md(text=md_text)

for el in elements:
    if el.category == "Title": print("{}: {}".format(el.category,el.text))

The moment any line of the list block (including the intro line) has more characters, the list block is then properly recognized as a NarrativeText.

Package version: unstructured 0.14.5

MthwRobinson commented 2 months ago

@nickphilip - Thanks for reporting this, we'll take a look as soon as we're able.

scanny commented 2 months ago

@MthwRobinson Turns out this behavior is produced by the markdown package, which is what partition_md() uses to convert Markdown to HTML.

md_text = """
My list
- item 1
- item 2
- item 3
"""

>>> markdown.markdown(md_text)
<p>My list
- item 1
- item 2
- item 3</p>

Adding a blank line before the list results in the expected behavior:

md_text = """
My list

- item 1
- item 2
- item 3
"""

>>> markdown.markdown(md_text)
<p>My list</p>
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 3</li>
</ul>

The "official" John Gruber spec and reference implementation specifies this behavior, that a list requires a blank line before it. The gist is that a word-wrapped line that just happens to start with a "-" or "*" should not be interpreted as a list item, like:

The computed value of 7
* 3 equals 21.

In practice, Markdown parsers are more clever about this and are able to distinguish author intent reliably without the blank line before the list.


Note that using the markdown_it package (already a dependency I believe, possibly for ingest), we get a "correct" conversion with or without the blank line.

My general impression, after poking around a bit, is that the markdown package is overly dogmatic, like adhering closely to the "original" standard from 2005 or so. I think what we care about is that the maximum number of legitimate Markdown files parse as they would be expected to, like how they appear on GitHub or whatever other site they came from.

This "defacto-standard" behavior is what the CommonMark standard articulates and markdown_it is CommonMark compliant. It's also somewhat more modern, being only 1 decade old rather than 2 :)

I'm inclined to think we'd be better off in general using the markdown_it package since I believe it would parse more Markdown documents the way the author intended and the way they appear when rendered on their source.

scanny commented 2 months ago

@nickphilip In the meantime, a workaround for this would be:

from markdown_it import MarkdownIt

md = MarkdownIt("commonmark", {"html": True}).enable("table")
html_text = md.render(md_text)
elements = partition_html(text=html_text)

Basically converting the MD to HTML yourself and then using partition_html() on the result.

scanny commented 2 months ago

Btw, @nickphilip what was the provenance of that problematic Markdown?

Is that something you just typed in by hand for experimentation purposes or did you have an actual "mis"-partitioning from a document from say StackOverflow or something?

nickphilip commented 2 months ago

@scanny - I found the issue testing unstructured.io against Markdown documents from public repos, as I noticed that my RAG app was not getting the expected context.

Great assessment of the root cause. Your workaround makes sense, but the fix would be preferable if one uses the runner (pipeline). Thanks!

scanny commented 2 months ago

@nickphilip when you say "public repos", does that imply this would be GFM (GitHub Flavored markdown) you're parsing?

scanny commented 2 months ago

@MthwRobinson After noodling this a bit and a little more research, I think there are four possible courses:

  1. support "traditional" (maybe "legacy") Markdown (what we're doing today)
  2. support CommonMark Markdown
  3. support GFM (GitHub Flavored Markdown)
  4. support multiple using a partitioning option of some sort

I think the right answer is 2 (CommonMark).

I'm not sure there's a perfect answer, but I think CommonMark provides the broadest coverage for what we care about when partitioning and so is a good step in the right direction. I think we'd all prefer to avoid a new partitioning option until we can show a compelling use case that requires it.