IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
314 stars 135 forks source link

[Bug] Error while running doc_chunk transform #794

Closed touma-I closed 1 week ago

touma-I commented 2 weeks ago

Search before asking

Component

Other

What happened + What you expected to happen

Error while using doc_chunk transform:

17:03:33 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}
17:03:33 INFO - pipeline id pipeline_id
17:03:33 INFO - code location None
17:03:33 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
17:03:33 INFO - actor creation delay 0
17:03:33 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}
17:03:33 INFO - data factory data_ is using local data access: input_folder - output[/01_parquet_out](http://localhost:8888/01_parquet_out) output_folder - output[/02_chunk_out](http://localhost:8888/02_chunk_out)
17:03:33 INFO - data factory data_ max_files -1, n_sample -1
17:03:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
17:03:33 INFO - Running locally
2024-11-11 17:03:34,248 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(orchestrate pid=85221) 17:03:36 INFO - orchestrator started at 2024-11-11 17:03:36
(orchestrate pid=85221) 17:03:36 INFO - Number of files is 2, source profile {'max_file_size': 0.006812095642089844, 'min_file_size': 0.006754875183105469, 'total_file_size': 0.013566970825195312}
(orchestrate pid=85221) 17:03:36 INFO - Cluster resources: {'cpus': 12, 'gpus': 0, 'memory': 15.994012451730669, 'object_store': 2.0}
(orchestrate pid=85221) 17:03:36 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
(orchestrate pid=85221) 17:03:39 INFO - Completed 0 files (0.0%)  in 0.0 min. Waiting for completion
(orchestrate pid=85221) 17:03:39 INFO - Completed processing 2 files in 0.0 min
(orchestrate pid=85221) 17:03:39 INFO - done flushing in 0.001 sec
(RayTransformFileProcessor pid=85226) 17:03:39 WARNING - Exception processing file /Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/intro/output/01_parquet_out/mars.parquet: Traceback (most recent call last):
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py#line=78), in process_file
(RayTransformFileProcessor pid=85226)     out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
(RayTransformFileProcessor pid=85226)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
(RayTransformFileProcessor pid=85226)     out_tables, stats = self.transform(table=table, file_name=file_name)
(RayTransformFileProcessor pid=85226)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py", line 154](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py#line=153), in transform
(RayTransformFileProcessor pid=85226)     table = pa.Table.from_pylist(data)
(RayTransformFileProcessor pid=85226)             ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1984](http://localhost:8888/table.pxi#line=1983), in pyarrow.lib._Tabular.from_pylist
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 6044](http://localhost:8888/table.pxi#line=6043), in pyarrow.lib._from_pylist
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 4625](http://localhost:8888/table.pxi#line=4624), in pyarrow.lib.Table.from_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1547](http://localhost:8888/table.pxi#line=1546), in pyarrow.lib._sanitize_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/table.pxi", line 1528](http://localhost:8888/table.pxi#line=1527), in pyarrow.lib._schema_from_arrays
(RayTransformFileProcessor pid=85226)   File "pyarrow[/array.pxi", line 355](http://localhost:8888/array.pxi#line=354), in pyarrow.lib.array
(RayTransformFileProcessor pid=85226)   File "pyarrow[/array.pxi", line 42](http://localhost:8888/array.pxi#line=41), in pyarrow.lib._sequence_to_array
(RayTransformFileProcessor pid=85226)   File "pyarrow[/error.pxi", line 154](http://localhost:8888/error.pxi#line=153), in pyarrow.lib.pyarrow_internal_check_status
(RayTransformFileProcessor pid=85226)   File "pyarrow[/error.pxi", line 88](http://localhost:8888/error.pxi#line=87), in pyarrow.lib.check_status
(RayTransformFileProcessor pid=85226) OverflowError: Python int too large to convert to C long
(RayTransformFileProcessor pid=85226) 
(RayTransformFileProcessor pid=85227) 
(raylet) [2024-11-11 17:03:44,270 E 85209 25533550] (raylet) file_system_monitor.cc:111: [/tmp/ray/session_2024-11-11_17-03-33_299724_85071](http://localhost:8888/tmp/ray/session_2024-11-11_17-03-33_299724_85071) is over 95% full, available space: 10144796672; capacity: 249999998976. Object creation will fail if spilling is required.
(RayTransformFileProcessor pid=85227) 17:03:39 WARNING - Exception processing file [/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/intro/output/01_parquet_out/earth.parquet](http://localhost:8888/lab/workspaces/intro/output/01_parquet_out/earth.parquet): Traceback (most recent call last):
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py", line 79](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/runtime/transform_file_processor.py#line=78), in process_file
(RayTransformFileProcessor pid=85227)     out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
(RayTransformFileProcessor pid=85227)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
(RayTransformFileProcessor pid=85227)     out_tables, stats = self.transform(table=table, file_name=file_name)
(RayTransformFileProcessor pid=85227)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "[/Users/touma/data-prep-kit-0.2.2.dev2/examples/notebooks/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py", line 154](http://localhost:8888/lab/workspaces/rag/venv/lib/python3.11/site-packages/doc_chunk_transform.py#line=153), in transform
(RayTransformFileProcessor pid=85227)     table = pa.Table.from_pylist(data)
(RayTransformFileProcessor pid=85227)             ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1984](http://localhost:8888/table.pxi#line=1983), in pyarrow.lib._Tabular.from_pylist
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 6044](http://localhost:8888/table.pxi#line=6043), in pyarrow.lib._from_pylist
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 4625](http://localhost:8888/table.pxi#line=4624), in pyarrow.lib.Table.from_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1547](http://localhost:8888/table.pxi#line=1546), in pyarrow.lib._sanitize_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/table.pxi", line 1528](http://localhost:8888/table.pxi#line=1527), in pyarrow.lib._schema_from_arrays
(RayTransformFileProcessor pid=85227)   File "pyarrow[/array.pxi", line 355](http://localhost:8888/array.pxi#line=354), in pyarrow.lib.array
(RayTransformFileProcessor pid=85227)   File "pyarrow[/array.pxi", line 42](http://localhost:8888/array.pxi#line=41), in pyarrow.lib._sequence_to_array
(RayTransformFileProcessor pid=85227)   File "pyarrow[/error.pxi", line 154](http://localhost:8888/error.pxi#line=153), in pyarrow.lib.pyarrow_internal_check_status
(RayTransformFileProcessor pid=85227)   File "pyarrow[/error.pxi", line 88](http://localhost:8888/error.pxi#line=87), in pyarrow.lib.check_status
(RayTransformFileProcessor pid=85227) OverflowError: Python int too large to convert to C long
17:03:49 INFO - Completed execution in 0.265 min, execution result 0

Reproduction script

Code fragments to reproduce:

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_chunk_transform_ray import DocChunkRayTransformConfiguration

# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

Anything else

Input folder for data files that is causing the error:

01_parquet_out.tar.gz

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

touma-I commented 1 week ago

@sujee If you are interested in testing the latest dev release that has this fix, please do pip install of 0.2.2.dev2. cc: @shahrokhDaijavad

touma-I commented 1 week ago

Run test successfully using RAG notebook. This issue can now be closed: cc @shahrokhDaijavad @dolfim-ibm