Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/partition_html ouputs different results with different args #3116

Open KMayank29 opened 1 month ago

KMayank29 commented 1 month ago

Bug Description

When I pass the url to partition_html it outputs correct. However when I pass text it extracts bad content. I believe there is some bug in the source code in relation to when text is passed in the argument rather than url. I also tried with the source code and it is working fine. The entire code has been shared below.

Code snippet

from unstructured.partition.html import partition_html
# pass url
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
elements = partition_html(url=url,html_assemble_articles=False)
elements_dict = [elem.to_dict() for elem in elements]
print(len(elements))
# 71

# pass text
import requests
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
response = requests.get(url)

elements = partition_html(text=response.text, html_assemble_articles=False)
elements_dict = [elem.to_dict() for elem in elements]
print(elements)
# 7

# use source code
from unstructured.documents.html import HTMLDocument
from unstructured.documents.xml import VALID_PARSERS
from unstructured.partition.common import document_to_element_list
from unstructured.partition.lang import apply_lang_metadata

document = HTMLDocument.from_string(str(response.text))
elements = list(
        apply_lang_metadata(
            document_to_element_list(
                document,
                sortable=False,
                include_page_breaks=False,
                detection_origin=None,
            ),
            languages=['auto'],
            detect_language_per_element=False,
        ),
    )
elements_dict = [elem.to_dict() for elem in elements]
print(elements)
# 71
MthwRobinson commented 1 month ago

Hi @KMayank29 - thanks for reporting. To clarify, does partition_html(url=url) or partition_html(text=response.text) give the correct output?

scanny commented 1 month ago

@KMayank29 sounds like an encoding error. What is the type of text when you pass it in?

If it is bytes and the HTML doesn't include an encoding declaration you'll want to decode it to str before passing in. Something like:

html_text = html_bytes.decode("utf-8")

You'll need to work out the encoding for your case, it wouldn't necessarily be "utf-8".

KMayank29 commented 1 month ago

Hi @KMayank29 - thanks for reporting. To clarify, does partition_html(url=url) or partition_html(text=response.text) give the correct output?

pertition_html(url=url) gives the correct output. partition_html(text=response.text) outputs only 2 or 3 sentences and only two types of element.

KMayank29 commented 1 month ago

@KMayank29 sounds like an encoding error. What is the type of text when you pass it in?

If it is bytes and the HTML doesn't include an encoding declaration you'll want to decode it to str before passing in. Something like:

html_text = html_bytes.decode("utf-8")

You'll need to work out the encoding for your case, it wouldn't necessarily be "utf-8".

I pass the str type into partition_html(text=response.text).

import requests
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
response = requests.get(url)
type(response.text)
# str
KMayank29 commented 1 month ago

Any update?