IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
228 stars 120 forks source link

[Bug] Notebook example not finding input files during ededup when file not directory is specified. #127

Closed daw3rd closed 1 month ago

daw3rd commented 5 months ago

Search before asking

Component

Other

What happened + What you expected to happen

To be able run the example notebook successfully.

Reproduction script

Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip

git clone ...
cd data-prep-kit/examples
make venv
make jupyter

Edit notebook

zip_input_folder = "/Users/dawood/Downloads/data-prep-kit-dev.zip"

Run through to ededup section to get

13:12:31 INFO - Running locally
13:12:31 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:12:31 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:12:31 INFO - data factory data_ max_files -1, n_sample -1
13:12:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:12:31 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:12:31 INFO - pipeline id pipeline_id; number workers 3
13:12:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:12:31 INFO - code location None
13:12:31 INFO - actor creation delay 0
2024-05-14 13:12:33,666 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(orchestrate pid=23909) 13:12:34 INFO - orchestrator started at 2024-05-14 13:12:34
(orchestrate pid=23909) 13:12:34 ERROR - No input files to process - exiting
13:12:44 INFO - Completed execution in 0.21483153502146404 min, execution result 0

Anything else

In the end, this was a user error, in that I set input_folder to the name of the zip file. Perhaps a check to make sure input_folder is a directory and not a file?

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

blublinsky commented 5 months ago

@daw3rd How is this a bug?. It says specifically zip_input_folder = not zip_input_file =. Again, we can ask for enhancement for better error handling, but its definitely not a bug

daw3rd commented 1 month ago

closing this one since notebooks have been redesigned/tested.