Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

bug/HTMLTitle doesn't have `type` attribute #3144

Closed FanaHOVA closed 3 months ago

FanaHOVA commented 4 months ago

Describe the bug HTMLTitle elements don't have a type attribute.

To Reproduce You can try in your own notebook here: https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=X2jIJGn6GM-d, you cannot call doc.pages[2].elements[0].type since it doesn't have a type attribute.

Expected behavior Each element should return its type, as described in the docs: https://docs.unstructured.io/open-source/concepts/document-elements#element-type, or you should add docs that explain how to achieve that. I'm looking to filter elements by type, and it breaks with HTML parsing.

Screenshots //

Environment Info See Colab

Additional context //

scanny commented 4 months ago

Thanks for this @FanaHOVA, taking a look.

scanny commented 3 months ago

Hi @FanaHOVA, ok, I've been working on the HTML code for the last couple days and in the process I've become familiar with every nook and cranny :)

The tl;dr version of the response is you should use .category for classification purposes when dealing with "in-memory" Element objects returned by a partitioner.

Some additional context:

  1. The documentation for this definitely needs some refinement, I've set that in motion.
  2. One problem is that HTML-specific element-types have started "leaking" into the output. Elements like HTMLTitle and HTMLNarrativeText (having additional attributes used by the partitioning code) used to be converted to the standard Title and NarrativeText elements once partitioning was complete, but somehow that step got dropped. That is remedied in a PR I should be able to merge this week. Note those HTML-specific element-types behave in every way the same as their "regular" counterpart, except the class name or "type".
  3. No Element subtypes have a .type attribute. The documentation could be more clear on this. What the documentation shows is the serialization of an element to dict or JSON and those do have a "type" key. This is so they can be deserialized ("rehydrated") later into an Element of the right sub-type.
  4. All Element subtypes do however have a .category attribute and this is commonly used for filtering or other classification purposes.
  5. It's also common to use isinstance(element, Title) or type(element).__name__ == "Title", which is one reason the HTML-specific element-type leakage is a problem, but .category should give the expected results in every case.

Let me know if this doesn't solve your problem or if you need more to go on. I'll leave this issue open for now and retire it when I get the type-leakage fix merged in the next couple days.

scanny commented 3 months ago

HTML-specific document element leakage fixed by #3207.