Closed mahmoudaymo closed 3 months ago
Hi @mahmoudaymo, is this behavior reliably reproducible? Like does it happen every time you run the ingest on that particular file or only some of the time?
Try adding num_processes=1
in the ingest call. If you update the post to show the full Python code you're using for the ingest call I may be able to direct you where it goes more precisely.
Let us know how you go and we'll take it from there.
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import (
ChunkingConfig,
PartitionConfig,
ProcessorConfig,
ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
processor_config = ProcessorConfig(
work_dir="work_dir",
num_processes=10,
output_dir="output_dir",
verbose=True,
)
connector_config = SimpleLocalConfig(
input_path="input_path",
recursive=True,
file_glob=["*.msg", "*.eml"],
)
read_config = ReadConfig()
partition_config = PartitionConfig(
strategy="hi_res",
additional_partition_args=dict(process_attachments=True, infer_table_structure=True, include_page_breaks=True, analysis=True),
hi_res_model_name="yolox",
)
chunking_config = ChunkingConfig(
chunking_strategy="by_title",
combine_text_under_n_chars=500,
max_characters=1500,
include_orig_elements=False,
multipage_sections=True,
new_after_n_chars=1000,
overlap=150,
overlap_all=False,
)
runner = LocalRunner(
processor_config=processor_config,
connector_config=connector_config,
read_config=read_config,
partition_config=partition_config,
chunking_config=chunking_config,
)
runner.run()
When I add attachment_partitioner=partition
to additional_partition_args
I get serialization error.
Yes, I get the error every time I ran the code.
Setting the num_processes=1
seems to solve the problem. Is there any way to find what is the optimal number of process to use? because with one process it is much slower.
Also, is there a way to add a cleaning step before or after chunking/partitioning?
@mahmoudaymo Setting the num_processes
to 1 was purely for diagnostic purposes. The problem appears to be:
.doc
files as attachments to the emails you're processingpartition_docx()
The fix for this is in #3287 which should be along shortly :)
After that, I'd say set num_processes
to no more than the number of physical core you have on the machine and then back off one at a time until you see a noticeable performance degradation. Often running all cores at 100% is not the most efficient configuration.
Thank you very much for the detailed explanation, it all make sense now.
Describe the bug When ingesting ".msg" and ".eml" files I get an error to process the attachements. However, if I use the _partitionmsg on the failed file everything works fine.
To Reproduce Case 1, using ingestion
Case 2, using _partitionmsg
Expected behavior to process all the attachement correctly but I get an error in Case 1:
it happens when the email has one of the following file types attached, '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx'
While in Case 2, it partition without any error.
Screenshots If applicable, add screenshots to help explain your problem.
Environment Info Please run
python scripts/collect_env.py
and paste the output here.OS version: Linux-5.14.0-362.24.1.el9_3.x86_64-x86_64-with-glibc2.35 Python version: 3.10.12 unstructured version: 0.14.8 unstructured-inference version: 0.7.36 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 version: 0.6 PaddleOCR version: 2.7.3 Traceback (most recent call last): File "/app/GLAnswers/scripts/collect_env.py", line 242, in
main()
File "scripts/collect_env.py", line 224, in main
libmagic_version = get_libmagic_version()
File "scripts/collect_env.py", line 146, in get_libmagic_version
result = subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 503, in run
with Popen(*popenargs, kwargs) as process:
File "/usr/lib/python3.10/subprocess.py", line 971, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'file'
Additional context**
Add any other context about the problem here.