Closed shahrokhDaijavad closed 1 week ago
I also experienced this issue with 0.2.2.dev2
it works fine with 0.2.2.dev1
but it seems model loading was slow ( took around 3 mins )
confirming with both py311 and py312 envs.
After installing 0.2.2.dev2 I had the following package
data_prep_toolkit 0.2.2.dev2
data_prep_toolkit_transforms 0.2.2.dev2
docling 2.3.1
docling-core 2.3.0
docling-ibm-models 2.0.3
docling-parse 2.0.2
Confirmed I can reproduce it locally. Using a fresh venv with the following packages
python3.11 -m venv venv
source venv/bin/activate
pip install \
'data-prep-toolkit[ray]==0.2.2.dev2' \
'data-prep-toolkit-transforms[ray,pdf2parquet,doc_id,doc_chunk,ededup,text_encoder]==0.2.2.dev2'
pip install jupyterlab ipykernel ipywidgets
The issue seems to be related to pyarrow, which is not able to cast long uint64 integer to the right type. The new transform is using efficient uint64 hashing for the binary content, which produces this issue.
This is a minimal example
import pyarrow as pa
pa.Table.from_pylist([{"binary_hash": 17915699055171962696}])
PR with fix: https://github.com/IBM/data-prep-kit/pull/793
Search before asking
Component
Other
What happened + What you expected to happen
In the notebook example here: https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb, we convert 2 pdf files earth.pdf and mars.pdf from the input/solar-system directory. We had no problems with the conversion to parquet of these 2 files, when using the older version of Docling library that was in 0.2.2.dev1 release, but with the latest Docling in 0.2.2.dev2, we get the following reproducible error:
11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf'. No results produced. 11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/earth.pdf: Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary table = pa.Table.from_pylist(data) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 355, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status OverflowError: Python int too large to convert to C long
11:09:57 INFO - Completed 1 files (50.0%) in 0.013 min 11:09:57 ERROR - Fatal error with file file_name='/Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf'. No results produced. 11:09:57 WARNING - Exception processing file /Users/shahrokhdaijavad/Documents/GitHub/data-prep-kit-testing/examples/notebooks/intro/input/solar-system/mars.pdf: Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/pdf2parquet_transform.py", line 371, in transform_binary table = pa.Table.from_pylist(data) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 1984, in pyarrow.lib._Tabular.from_pylist File "pyarrow/table.pxi", line 6044, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 4625, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1547, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1528, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 355, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status OverflowError: Python int too large to convert to C long
11:09:57 INFO - Completed 2 files (100.0%) in 0.021 min 11:09:57 INFO - Done processing 2 files, waiting for flush() completion. 11:09:57 INFO - done flushing in 0.0 sec Traceback (most recent call last): File "/opt/anaconda3/envs/data-prep-kit/lib/python3.11/site-packages/data_processing/runtime/pure_python/transform_orchestrator.py", line 131, in orchestrate stats["processing_time"] = round(stats["processing_time"], 3)