IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
50 stars 24 forks source link

[Bug] Example notebook finding no input files in ededup #128

Open daw3rd opened 1 month ago

daw3rd commented 1 month ago

Search before asking

Component

Other

What happened + What you expected to happen

Getting error messages during ededup section saying there are no input files.

Reproduction script

Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip

mkdir /tmp/example
cp data-prep-kit-dev.zip /tmp/example
git clone ...
cd data-prep-kit/examples
make venv
make jupyter

Edit notebook

zip_input_folder = "/tmp/example"

Run notebook through ededup section and get logged messages say no input files


3:26:29 INFO - Running locally
13:26:29 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:26:29 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:26:29 INFO - data factory data_ max_files -1, n_sample -1
13:26:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:26:29 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:26:29 INFO - pipeline id pipeline_id; number workers 3
13:26:29 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:26:29 INFO - code location None
13:26:29 INFO - actor creation delay 0
2024-05-14 13:26:31,387 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(orchestrate pid=24207) 13:26:32 INFO - orchestrator started at 2024-05-14 13:26:32
(orchestrate pid=24207) 13:26:32 ERROR - No input files to process - exiting
13:26:42 INFO - Completed execution in 0.21104646523793538 min, execution result 0

### Anything else

_No response_

### OS

MacOS (limited support)

### Python

3.10.x

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!
blublinsky commented 1 month ago

I do not think it is a bug. Input folder in this case is not configured correctly. Execution thinks that:

data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out

where your input folder should be /tmp/example

daw3rd commented 1 month ago

@blublinsky agreed. user error. however, we need to expect the user to make this sort of mistake and help them fix it.

blublinsky commented 1 month ago

@daw3rd. Agreed, but this is not a bug. We can ask for enhancement for better error handling, but do not qualify it as a bug

Bytes-Explorer commented 1 month ago

@shivdeep-singh-ibm Has this been done?