IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

Problem with converting pdf files in the intro example when using release 0.2.2.dev2 #767

Closed shahrokhDaijavad closed 1 week ago

shahrokhDaijavad commented 2 weeks ago

Search before asking

Component

Other

What happened + What you expected to happen

In the notebook example here: https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, we convert 2 pdf files earth.pdf and mars.pdf from the input/solar-system directory. We had no problems with the conversion to parquet of these 2 files, when using the older version of Docling library that was in 0.2.2.dev1 release, but with the latest Docling in 0.2.2.dev2, we get the following reproducible error:

11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf'. No results produced. 11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf: Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary table = pa.Table.from_pylist(data) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 355, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 1 files (50.0%) in 0.013 min 11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf'. No results produced. 11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf: Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary table = pa.Table.from_pylist(data) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 355, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status OverflowError: Python int too large to convert to C long

11:09:57 INFO - Completed 2 files (100.0%) in 0.021 min 11:09:57 INFO - Done processing 2 files, waiting for flush() completion. 11:09:57 INFO - done flushing in 0.0 sec Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 131, in orchestrate stats["processing_time"] = round(stats["processing_time"], 3)


KeyError: 'processing_time'
11:09:57 ERROR - Exception during execution 'processing_time': None
11:09:57 INFO - Completed execution in 0.085 min, execution result 1

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
File <timed exec>:40

Exception: ❌ Job failed

### Reproduction script

Run https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, after pip installing release 0.2.2..dev2

### Anything else

_No response_

### OS

MacOS (limited support)

### Python

3.11.x

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!
santoshborse commented 2 weeks ago

I also experienced this issue with 0.2.2.dev2 it works fine with 0.2.2.dev1 but it seems model loading was slow ( took around 3 mins )

sujee commented 2 weeks ago

confirming with both py311 and py312 envs.

After installing 0.2.2.dev2 I had the following package

data_prep_toolkit            0.2.2.dev2
data_prep_toolkit_transforms 0.2.2.dev2

docling                      2.3.1
docling-core                 2.3.0
docling-ibm-models           2.0.3
docling-parse                2.0.2
dolfim-ibm commented 1 week ago

Confirmed I can reproduce it locally. Using a fresh venv with the following packages

python3.11 -m venv venv
source venv/bin/activate

pip install \
    'data-prep-toolkit[ray]==0.2.2.dev2'  \
    'data-prep-toolkit-transforms[ray,pdf2parquet,doc_id,doc_chunk,ededup,text_encoder]==0.2.2.dev2'

pip install jupyterlab   ipykernel  ipywidgets
dolfim-ibm commented 1 week ago

The issue seems to be related to pyarrow, which is not able to cast long uint64 integer to the right type. The new transform is using efficient uint64 hashing for the binary content, which produces this issue.

This is a minimal example

import pyarrow as pa
pa.Table.from_pylist([{"binary_hash": 17915699055171962696}])
dolfim-ibm commented 1 week ago

PR with fix: https://github.com/IBM/data-prep-kit/pull/793