Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.83k stars 725 forks source link

bug/Fails to parse .doc files in some cases. #3518

Closed Pedrito1968 closed 2 months ago

Pedrito1968 commented 2 months ago

Describe the bug unstructured fails to parse old word .doc files. I don't think it happens in all cases. See the mechanism of the bug below.

Instead I get: "The MIME type is 'application/CDFV2'. This file type is not currently supported in unstructured."

To Reproduce

file = request.files['blob']
file_bytes = file.read()
file_io = BytesIO(file_bytes)
elements = partition(file=file_io)

where 'blob' contains a .doc file.

Expected behavior Should parse the doc file

Additional context This is actually an issue in the underlying magic library, but you should be able to work around it.

In /unstructured/file_utils/filetype.py, the mime_type() function has:

        mime_type = (
            magic.from_file(file_path, mime=True)
            if file_path
            else magic.from_buffer(self.file_head, mime=True)

If you use magic.from_buffer(), and the buffer is less than the size of the file, it will return a mime-type of application/CDFV2. If the buffer is larger than the file or you use magic.from_file(), it will correctly return application/msword.

def file_head(self) -> bytes:
    """The initial bytes of the file to be recognized, for use with libmagic detection."""
    with self.open() as file:
        return file.read(8192)

So in this case, if the doc file is larger than 8192 bytes, it will incorrectly return CDFV2. Here's an example:

Python 3.8.10 (default, Jul 29 2024, 17:02:10)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import magic
>>> magic.from_buffer(open('things.doc', mode='rb').read(8192), mime=True)
'application/CDFV2'
>>> magic.from_buffer(open('things.doc', mode='rb').read(65536), mime=True)
'application/msword'
>>> magic.from_file('things.doc', mime=True)
'application/msword'
Pedrito1968 commented 2 months ago

After reviewing the magic code, I'm pretty sure this will affect any legacy Office formats.

scanny commented 2 months ago

@Pedrito1968 do you have an example of code and a file that is not properly identified?

Also, note that the file-type auto-detection code changed quite a bit pretty recently, so make sure you're using the current version :)

Pedrito1968 commented 2 months ago

The code in the To Reproduce section will do it. Any .doc file (not .docx).

The code I pasted from Unstructured is from the main branch in the repo. I copied and pasted it yesterday. So if it hasn't changed since yesterday, it's still broken.

Again, as I demonstrated here, the issue is in magic and how you're using the from_buffer():

>>> import magic
>>> magic.from_buffer(open('things.doc', mode='rb').read(8192), mime=True)
'application/CDFV2'
>>> magic.from_buffer(open('things.doc', mode='rb').read(65536), mime=True)
'application/msword'
>>> magic.from_file('things.doc', mime=True)
'application/msword'

To work around this issue on my end, I ended up writing the data to an actual file and then using the partition(filename=file_path) instead of file=, as that path uses the magic.from_file call which will correctly return the mime type.

scanny commented 2 months ago

What version of unstructured are you using?

magic should no longer be consulted for .doc files.

Pedrito1968 commented 2 months ago

I'm using 0.11.8. If you're not using magic anymore, then it's not an issue. I've worked around it.

So I guess your "main" branch isn't your release branch? Because your main branch is using magic:

file_utils/filetype.py

import os
import re
import zipfile
from typing import IO, Callable, Iterator, Optional
import filetype as ft
from typing_extensions import ParamSpec
from unstructured.documents.elements import Element
from unstructured.file_utils.encoding import detect_file_encoding, format_encoding_str
from unstructured.file_utils.model import FileType
from unstructured.logger import logger
from unstructured.nlp.patterns import EMAIL_HEAD_RE, LIST_OF_DICTS_PATTERN
from unstructured.partition.common import (
    add_element_metadata,
    exactly_one,
    remove_element_metadata,
    set_element_hierarchy,
)
from unstructured.utils import get_call_args_applying_defaults, lazyproperty

LIBMAGIC_AVAILABLE = bool(importlib.util.find_spec("magic"))
scanny commented 2 months ago

@Pedrito1968 latest is 0.15.4. 0.11.8 is from seven months ago or so.

Libmagic is still available when it's installed and we still use it, but if you trace though the code path we don't rely on it for DOC files. A DOC file is contained in an OLE (aka. CFBF) "package", vaguely like a Microsoft Zip format and it's easy and reliable to detect those. That narrows down the choices to DOC, PPT, XLS, or MSG and we use the filetype package to distinguish between those as it's more reliable for the OLE subtypes.