Closed FanaHOVA closed 3 months ago
Thanks for this @FanaHOVA, taking a look.
Hi @FanaHOVA, ok, I've been working on the HTML code for the last couple days and in the process I've become familiar with every nook and cranny :)
The tl;dr version of the response is you should use .category
for classification purposes when dealing with "in-memory" Element
objects returned by a partitioner.
Some additional context:
HTMLTitle
and HTMLNarrativeText
(having additional attributes used by the partitioning code) used to be converted to the standard Title
and NarrativeText
elements once partitioning was complete, but somehow that step got dropped. That is remedied in a PR I should be able to merge this week. Note those HTML-specific element-types behave in every way the same as their "regular" counterpart, except the class name or "type".Element
subtypes have a .type
attribute. The documentation could be more clear on this. What the documentation shows is the serialization of an element to dict
or JSON
and those do have a "type"
key. This is so they can be deserialized ("rehydrated") later into an Element
of the right sub-type.Element
subtypes do however have a .category
attribute and this is commonly used for filtering or other classification purposes.isinstance(element, Title)
or type(element).__name__ == "Title"
, which is one reason the HTML-specific element-type leakage is a problem, but .category
should give the expected results in every case.Let me know if this doesn't solve your problem or if you need more to go on. I'll leave this issue open for now and retire it when I get the type-leakage fix merged in the next couple days.
HTML-specific document element leakage fixed by #3207.
Describe the bug HTMLTitle elements don't have a
type
attribute.To Reproduce You can try in your own notebook here: https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=X2jIJGn6GM-d, you cannot call
doc.pages[2].elements[0].type
since it doesn't have a type attribute.Expected behavior Each element should return its type, as described in the docs: https://docs.unstructured.io/open-source/concepts/document-elements#element-type, or you should add docs that explain how to achieve that. I'm looking to filter elements by type, and it breaks with HTML parsing.
Screenshots //
Environment Info See Colab
Additional context //