Closed Pedrito1968 closed 2 months ago
After reviewing the magic code, I'm pretty sure this will affect any legacy Office formats.
@Pedrito1968 do you have an example of code and a file that is not properly identified?
Also, note that the file-type auto-detection code changed quite a bit pretty recently, so make sure you're using the current version :)
The code in the To Reproduce section will do it. Any .doc file (not .docx).
The code I pasted from Unstructured is from the main branch in the repo. I copied and pasted it yesterday. So if it hasn't changed since yesterday, it's still broken.
Again, as I demonstrated here, the issue is in magic and how you're using the from_buffer():
>>> import magic
>>> magic.from_buffer(open('things.doc', mode='rb').read(8192), mime=True)
'application/CDFV2'
>>> magic.from_buffer(open('things.doc', mode='rb').read(65536), mime=True)
'application/msword'
>>> magic.from_file('things.doc', mime=True)
'application/msword'
To work around this issue on my end, I ended up writing the data to an actual file and then using the partition(filename=file_path) instead of file=, as that path uses the magic.from_file call which will correctly return the mime type.
What version of unstructured
are you using?
magic
should no longer be consulted for .doc
files.
I'm using 0.11.8. If you're not using magic anymore, then it's not an issue. I've worked around it.
So I guess your "main" branch isn't your release branch? Because your main branch is using magic:
file_utils/filetype.py
import os
import re
import zipfile
from typing import IO, Callable, Iterator, Optional
import filetype as ft
from typing_extensions import ParamSpec
from unstructured.documents.elements import Element
from unstructured.file_utils.encoding import detect_file_encoding, format_encoding_str
from unstructured.file_utils.model import FileType
from unstructured.logger import logger
from unstructured.nlp.patterns import EMAIL_HEAD_RE, LIST_OF_DICTS_PATTERN
from unstructured.partition.common import (
add_element_metadata,
exactly_one,
remove_element_metadata,
set_element_hierarchy,
)
from unstructured.utils import get_call_args_applying_defaults, lazyproperty
LIBMAGIC_AVAILABLE = bool(importlib.util.find_spec("magic"))
@Pedrito1968 latest is 0.15.4. 0.11.8 is from seven months ago or so.
Libmagic is still available when it's installed and we still use it, but if you trace though the code path we don't rely on it for DOC files. A DOC file is contained in an OLE (aka. CFBF) "package", vaguely like a Microsoft Zip format and it's easy and reliable to detect those. That narrows down the choices to DOC, PPT, XLS, or MSG and we use the filetype
package to distinguish between those as it's more reliable for the OLE subtypes.
Describe the bug unstructured fails to parse old word .doc files. I don't think it happens in all cases. See the mechanism of the bug below.
Instead I get: "The MIME type is 'application/CDFV2'. This file type is not currently supported in unstructured."
To Reproduce
where 'blob' contains a .doc file.
Expected behavior Should parse the doc file
Additional context This is actually an issue in the underlying magic library, but you should be able to work around it.
In /unstructured/file_utils/filetype.py, the mime_type() function has:
If you use magic.from_buffer(), and the buffer is less than the size of the file, it will return a mime-type of application/CDFV2. If the buffer is larger than the file or you use magic.from_file(), it will correctly return application/msword.
So in this case, if the doc file is larger than 8192 bytes, it will incorrectly return CDFV2. Here's an example: