Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.64k stars 705 forks source link

FIX: The <div> text element with one <br> will not be regarded as a text element by `_is_text_tag` #3209

Open heya5 opened 3 months ago

heya5 commented 3 months ago

The below element will not be regarded as a text element by _is_text_tag:

<div>AI solutions <br> which suit you</div>

So we should ignore <br> tag when consider the point:

    # NOTE(robinson) - This indicates that a div tag has no children. If that's the
    # case and the tag has text, its potential a text tag