Describe the bug
When partitioning a JSON file using partition() and providing a metadata_filename argument that has a .html extension, the result is a single element with the entire JSON file contents as its text.
To Reproduce
file_path = example_doc_path("simple.json")
with open(file_path, "rb") as f:
elements = partition(file=f, metadata_filename="simple.html")
print(f"{elements}")
print(f"{elements[0].text}")
produces:
[<unstructured.documents.elements.NarrativeText object at 0x371953bb0>]
[
{
"element_id": "a06d2d9e65212d4aa955c3ab32950ffa",
"metadata": {
"category_depth": 0,
"file_directory": "unstructured/example-docs",
"filename": "simple.docx",
"filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"languages": [
"eng"
],
"last_modified": "2024-07-06T16:44:51"
},
"text": "These are a few of my favorite things:",
"type": "Title"
},
{
"element_id": "b334c93e9b1cbca3b6f6d78ce8bc2484",
"metadata": {
...
Expected behavior
The same output as elements_from_json("simple.json"). The metadata_filename argument should be ignored.
Additional context
Because this behavior does not occur when using partition_json(), I believe it is an artifact of detect_filetype() somehow using metadata_filename for disambiguation. Because the original filename for serialized elements was definitely not something.json, that's not going to work for JSON files.
Turns out other weird things happen when the metadata_filename has different extensions, like simple.docx. So it would appear the file-type is getting mis-identified and the file is sent to the wrong partitioner.
Describe the bug When partitioning a JSON file using
partition()
and providing ametadata_filename
argument that has a.html
extension, the result is a single element with the entire JSON file contents as its text.To Reproduce
produces:
Expected behavior The same output as
elements_from_json("simple.json")
. Themetadata_filename
argument should be ignored.Additional context
partition_json()
, I believe it is an artifact ofdetect_filetype()
somehow usingmetadata_filename
for disambiguation. Because the original filename for serialized elements was definitely notsomething.json
, that's not going to work for JSON files.