Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.31k stars 773 forks source link

bug/skipping-figures #3606

Open joelgwebber opened 2 months ago

joelgwebber commented 2 months ago

Describe the bug Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a <figure> is silently removed in partition_html(). Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in <figure>s.

I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.

To Reproduce

from unstructured.partition.html import partition_html

elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
    for elem in elems:
        if elem.text.find(text) >= 0:
            print("found it:\n", elem)
            return
    print("nope")

find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure

Expected behavior That the <figure> contents would either be found by default, or with an option controlling which elements to skip.

Environment Info I don't have a local build going yet, but I promise it's a trivial repro in any environment.

scanny commented 2 months ago

@joelgwebber this behavior is as designed, so not a bug per se.

<figure> is classified as a "removed block" for HTML parsing purposes: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L986 https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L517

We don't currently capture image URLs, although that is something that I've seen requested. There's some question as to how to represent those in the element-stream and metadata, so you might want to weigh in on that. I suppose I'm inclined for them to become Image elements, then the link could go in the metadata as .metadata.image_url and any caption we could detect could go into Image.text. The same behavior would be applied to <img> elements.