Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.8k stars 625 forks source link

bug/<Only element from partition is "Please enable JS and disable any ad blocker"> #2562

Closed StatsAI closed 2 weeks ago

StatsAI commented 5 months ago

Describe the bug Only element returned from partition is (unstructured.documents.html.HTMLTitle, 'Please enable JS and disable any ad blocker')

To Reproduce

!pip install "unstructured[all-docs]"

url = 'https://www.nytimes.com/2024/02/19/world/europe/navalny-letters-russia.html'

from unstructured.partition.auto import partition
elements = partition(url=url, strategy='hi_res', html_assemble_articles=True)

display(*[(type(element), element.text) for element in elements])

Expected behavior Partition results (Title, Narrative Text, etc) should be returned

Environment Info Google Colab

scanny commented 1 month ago

Fixed by #3218.

Note that the text extracted by the partitioner for this page is modest (about 25 elements) because most of the content is behind a paywall and generated by JavaScript.

But the "preview" content that is actually present in the HTML is correctly partitioned and the <noscript> tag that was previously rendered (saying "Please enable JS ...") is no longer present.

3218 should merge within a few days.