bug/skipping-figures - Githubissues

Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Apache License 2.0

9.31k stars 773 forks source link

Describe the bug Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a <figure> is silently removed in partition_html(). Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in <figure>s.

I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.

To Reproduce

from unstructured.partition.html import partition_html

elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
    for elem in elems:
        if elem.text.find(text) >= 0:
            print("found it:\n", elem)
            return
    print("nope")

find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure

Expected behavior That the <figure> contents would either be found by default, or with an option controlling which elements to skip.

Environment Info I don't have a local build going yet, but I promise it's a trivial repro in any environment.

@joelgwebber this behavior is as designed, so not a bug per se.

<figure> is classified as a "removed block" for HTML parsing purposes: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L986 https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L517

We don't currently capture image URLs, although that is something that I've seen requested. There's some question as to how to represent those in the element-stream and metadata, so you might want to weigh in on that. I suppose I'm inclined for them to become Image elements, then the link could go in the metadata as .metadata.image_url and any caption we could detect could go into Image.text. The same behavior would be applied to <img> elements.

Unstructured-IO / unstructured

bug/skipping-figures #3606