Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.12k stars 751 forks source link

bug(json): partition() places entire JSON file into text of single element when `metadata_filename` has .html extension #3366

Open scanny opened 4 months ago

scanny commented 4 months ago

Describe the bug When partitioning a JSON file using partition() and providing a metadata_filename argument that has a .html extension, the result is a single element with the entire JSON file contents as its text.

To Reproduce

file_path = example_doc_path("simple.json")

with open(file_path, "rb") as f:
    elements = partition(file=f, metadata_filename="simple.html")

print(f"{elements}")
print(f"{elements[0].text}")

produces:

[<unstructured.documents.elements.NarrativeText object at 0x371953bb0>]
[
    {
        "element_id": "a06d2d9e65212d4aa955c3ab32950ffa",
        "metadata": {
            "category_depth": 0,
            "file_directory": "unstructured/example-docs",
            "filename": "simple.docx",
            "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "languages": [
                "eng"
            ],
            "last_modified": "2024-07-06T16:44:51"
        },
        "text": "These are a few of my favorite things:",
        "type": "Title"
    },
    {
        "element_id": "b334c93e9b1cbca3b6f6d78ce8bc2484",
        "metadata": {
...

Expected behavior The same output as elements_from_json("simple.json"). The metadata_filename argument should be ignored.

Additional context

scanny commented 4 months ago

Turns out other weird things happen when the metadata_filename has different extensions, like simple.docx. So it would appear the file-type is getting mis-identified and the file is sent to the wrong partitioner.