Open joelgwebber opened 2 months ago
@joelgwebber this behavior is as designed, so not a bug per se.
<figure>
is classified as a "removed block" for HTML parsing purposes:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L986
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/parser.py#L517
We don't currently capture image URLs, although that is something that I've seen requested. There's some question as to how to represent those in the element-stream and metadata, so you might want to weigh in on that. I suppose I'm inclined for them to become Image
elements, then the link could go in the metadata as .metadata.image_url
and any caption we could detect could go into Image.text
. The same behavior would be applied to <img>
elements.
Describe the bug Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a
<figure>
is silently removed inpartition_html()
. Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in<figure>
s.I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.
To Reproduce
Expected behavior That the
<figure>
contents would either be found by default, or with an option controlling which elements to skip.Environment Info I don't have a local build going yet, but I promise it's a trivial repro in any environment.