IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
216 stars 119 forks source link

[Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667

Open sujee opened 1 week ago

sujee commented 1 week ago

Search before asking

Component

Tools/ingest2parquet

What happened + What you expected to happen

Happens when running RAY version, with NUM_WORKERS > 1. Reliably reproducible in google colab Running the cell again works.

But a negative user experience

(orchestrate pid=1575) 05:41:45 ERROR - Failed to process request worker exception The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=1784, ip=172.28.0.12, actor_id=09c62ae6504057816b30599401000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7ee7e55fbc40>)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_file_processor.py", line 46, in __init__
(orchestrate pid=1575)     self.transform = params.get("transform_class", None)(self.transform_params)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform_ray.py", line 40, in __init__
(orchestrate pid=1575)     super().__init__(config)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform.py", line 105, in __init__
(orchestrate pid=1575)     self._converter = DocumentConverter(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/document_converter.py", line 54, in __init__
(orchestrate pid=1575)     self.model_pipeline = pipeline_cls(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/pipeline/standard_model_pipeline.py", line 18, in __init__
(orchestrate pid=1575)     EasyOcrModel(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/models/easyocr_model.py", line 21, in __init__
(orchestrate pid=1575)     self.reader = easyocr.Reader(config["lang"])
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 92, in __init__
(orchestrate pid=1575)     detector_path = self.getDetectorPath(detect_network)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 253, in getDetectorPath
(orchestrate pid=1575)     download_and_unzip(self.detection_models[self.detect_network]['url'], self.detection_models[self.detect_network]['filename'], self.model_storage_directory, self.verbose)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/utils.py", line 631, in download_and_unzip
(orchestrate pid=1575)     os.remove(zip_path)
(orchestrate pid=1575) FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'

Reproduction script

https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb

Use open-in-colab link : https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb

Anything else

No response

OS

Other

Python

3.11.x

Are you willing to submit a PR?

blublinsky commented 1 week ago

the error is quite obvious:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'

its either file do not exist or location is wrong

sujee commented 1 week ago

Yes, the error is quite obvious 🤣 my suspicion is its caused by a race condition between workers trying to cleanup downloaded artifacts.

Adding: I see this consistently on Google colab, because each notebook gets their own sandbox.
To re-produce it locally, please delete the cache directory of downloaded artifacts (I am not sure where this is -- probably done by docling?)

sujee commented 1 week ago

related : #583

blublinsky commented 1 week ago

Yea, we know exactly why. Its up to the guys to decide what to do