Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.61k stars 704 forks source link

bug/<Ingestion error to process attachments for .msg files> #3284

Closed mahmoudaymo closed 3 months ago

mahmoudaymo commented 3 months ago

Describe the bug When ingesting ".msg" and ".eml" files I get an error to process the attachements. However, if I use the _partitionmsg on the failed file everything works fine.

To Reproduce Case 1, using ingestion

partition_config = PartitionConfig(
        strategy="hi_res",
        additional_partition_args=dict(process_attachments=True, include_page_breaks=True, analysis=True),
        hi_res_model_name="yolox",
    )

Case 2, using _partitionmsg

from unstructured.partition.auto import partition
elements = partition_msg(
    filename=filename.msg, 
    process_attachments=True, 
    attachment_partitioner=partition
)

Expected behavior to process all the attachement correctly but I get an error in Case 1:

2024-06-24 16:18:06,139 SpawnPoolWorker-18 INFO     Processing filename.msg
2024-06-24 16:18:06,139 SpawnPoolWorker-18 DEBUG    Using local partition
2024-06-24 16:18:06,143 SpawnPoolWorker-15 ERROR    failed to partition doc: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/app", "output_dir": "structured-output", "num_processes": 10, "raise_on_error": false}, "read_config": {"download_dir": "", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"input_path": "/path_to_data", "recursive": true, "file_glob": ["*.msg", "*.eml"]}, "_source_metadata": {"date_created": "2024-06-24 13:20:36.803776", "date_modified": "2024-06-23 08:59:21.073290", "version": null, "source_url": "filename.msg", "exists": true, "permissions_data": [{"mode": 33279}]}, "_date_processed": null, "path": "filename.msg", "registry_name": "local", "base_filename": "filename.msg", "filename": "filename.msg", "_output_filename": "structured-output/filename.msg.json", "record_locator": null, "unique_id": "/app/GLAnswers/data/00001/DOC0000000041.msg", "date_created": "2024-06-24 13:20:36.803776", "date_modified": "2024-06-23 08:59:21.073290", "date_processed": null, "exists": true, "permissions_data": [{"mode": 33279}], "version": null, "source_url": "filename.msg"}, Error in partitioning content: Package not found at '/tmp/tmpzgmhqsg0/PGESpreadvaluationdefinition.docx'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/unstructured/ingest/error.py", line 19, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/ingest/interfaces.py", line 559, in partition_file
    elements = partition(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py", line 355, in partition
    elements = _partition_msg(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/msg.py", line 67, in partition_msg
    return list(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/lang.py", line 399, in apply_lang_metadata
    elements = list(elements)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/msg.py", line 218, in iter_message_elements
    yield from cls(opts)._iter_message_elements()
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/msg.py", line 228, in _iter_message_elements
    yield from _AttachmentPartitioner.iter_elements(attachment, self._opts)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/msg.py", line 277, in _iter_elements
    for element in partition(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py", line 312, in partition
    elements = _partition_doc(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/doc.py", line 103, in partition_doc
    elements = partition_docx(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/docx.py", line 170, in partition_docx
    elements = _DocxPartitioner.iter_document_elements(opts)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/docx.py", line 385, in iter_document_elements
    if self._document_contains_sections
  File "/usr/local/lib/python3.10/dist-packages/unstructured/utils.py", line 187, in __get__
    value = self._fget(obj)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/docx.py", line 553, in _document_contains_sections
    return bool(self._document.sections)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/utils.py", line 187, in __get__
    value = self._fget(obj)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/docx.py", line 543, in _document
    return self._opts.document
  File "/usr/local/lib/python3.10/dist-packages/unstructured/utils.py", line 187, in __get__
    value = self._fget(obj)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/docx.py", line 225, in document
    return docx.Document(self._docx_file)
  File "/usr/local/lib/python3.10/dist-packages/docx/api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/pkgreader.py", line 22, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/phys_pkg.py", line 21, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
docx.opc.exceptions.PackageNotFoundError: Package not found at '/tmp/tmpzgmhqsg0/PGESpreadvaluationdefinition.docx'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/unstructured/ingest/pipeline/partition.py", line 48, in run
    elements = doc.process_file(
  File "/usr/local/lib/python3.10/dist-packages/unstructured/ingest/interfaces.py", line 597, in process_file
    elements = self.partition_file(partition_config=partition_config, **partition_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/unstructured/ingest/error.py", line 22, in wrapper
    raise cls(cls.error_string.format(str(error))) from error
unstructured.ingest.error.PartitionError: Error in partitioning content: Package not found at '/tmp/tmpzgmhqsg0/PGESpreadvaluationdefinition.docx'

it happens when the email has one of the following file types attached, '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx'

While in Case 2, it partition without any error.

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Please run python scripts/collect_env.py and paste the output here.

OS version: Linux-5.14.0-362.24.1.el9_3.x86_64-x86_64-with-glibc2.35 Python version: 3.10.12 unstructured version: 0.14.8 unstructured-inference version: 0.7.36 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 version: 0.6 PaddleOCR version: 2.7.3 Traceback (most recent call last): File "/app/GLAnswers/scripts/collect_env.py", line 242, in main() File "scripts/collect_env.py", line 224, in main libmagic_version = get_libmagic_version() File "scripts/collect_env.py", line 146, in get_libmagic_version result = subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 503, in run with Popen(*popenargs, kwargs) as process: File "/usr/lib/python3.10/subprocess.py", line 971, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'file' Additional context** Add any other context about the problem here.

scanny commented 3 months ago

Hi @mahmoudaymo, is this behavior reliably reproducible? Like does it happen every time you run the ingest on that particular file or only some of the time?

Try adding num_processes=1 in the ingest call. If you update the post to show the full Python code you're using for the ingest call I may be able to direct you where it goes more precisely.

Let us know how you go and we'll take it from there.

mahmoudaymo commented 3 months ago
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import (
    ChunkingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)
from unstructured.ingest.runner import LocalRunner

processor_config = ProcessorConfig(
    work_dir="work_dir",
    num_processes=10,
    output_dir="output_dir",
    verbose=True,
)
connector_config = SimpleLocalConfig(
    input_path="input_path",
    recursive=True,
    file_glob=["*.msg", "*.eml"],
)
read_config = ReadConfig()
partition_config = PartitionConfig(
    strategy="hi_res",
    additional_partition_args=dict(process_attachments=True, infer_table_structure=True, include_page_breaks=True, analysis=True),
    hi_res_model_name="yolox",
)

chunking_config = ChunkingConfig(
    chunking_strategy="by_title",
    combine_text_under_n_chars=500,
    max_characters=1500,
    include_orig_elements=False,
    multipage_sections=True,
    new_after_n_chars=1000,
    overlap=150,
    overlap_all=False,
)

runner = LocalRunner(
    processor_config=processor_config,
    connector_config=connector_config,
    read_config=read_config,
    partition_config=partition_config,
    chunking_config=chunking_config,
)
runner.run()

When I add attachment_partitioner=partition to additional_partition_args I get serialization error. Yes, I get the error every time I ran the code.

mahmoudaymo commented 3 months ago

Setting the num_processes=1 seems to solve the problem. Is there any way to find what is the optimal number of process to use? because with one process it is much slower.

Also, is there a way to add a cleaning step before or after chunking/partitioning?

scanny commented 3 months ago

@mahmoudaymo Setting the num_processes to 1 was purely for diagnostic purposes. The problem appears to be:

The fix for this is in #3287 which should be along shortly :)

After that, I'd say set num_processes to no more than the number of physical core you have on the machine and then back off one at a time until you see a noticeable performance degradation. Often running all cores at 100% is not the most efficient configuration.

mahmoudaymo commented 3 months ago

Thank you very much for the detailed explanation, it all make sense now.