IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
235 stars 122 forks source link

[Bug] Error running lang-id workflow #285

Closed revit13 closed 4 months ago

revit13 commented 4 months ago

Search before asking

Component

KFP workflows

What happened + What you expected to happen

Running make PYTHON=python3.10 workflow-test under transforms/language/lang_id dir produces the following error in execute ray job component:

05:43:02 WARNING - ERROR: 'Field "contents" does not exist in schema', skipping the file

the component is marked as green despite of the error.

Full log of the execute Ray job component:

time="2024-06-17T05:42:14.403Z" level=info msg="capturing logs" argo=true
05:42:20 INFO - request to execute: python lang_id_transform_ray.py --data_max_files=-1 --data_num_samples=-1 --data_s3_config="{'input_folder': 'test/lang_id/input/', 'output_folder': 'test/lang_id/output/'}" --lang_id_content_column_name="contents" --lang_id_model_credential="PUT YOUR OWN HUGGINGFACE CREDENTIAL" --lang_id_model_kind="fasttext" --lang_id_model_url="facebook/fasttext-language-identification" --runtime_code_location="{'github': 'github', 'commit_hash': '12345', 'path': 'path'}" --runtime_job_id="290c13d0-a50c-465f-8da0-db8b7fb7711d" --runtime_num_workers="4" --runtime_pipeline_id="pipeline_id" --runtime_worker_options="{'num_cpus': 0.8}" --data_s3_cred="{'access_key': 'minio', 'secret_key': 'minio123', 'url': '
http://minio-service.kubeflow.svc.cluster.local:9000/
'}" 
05:42:25 INFO - submitted job successfully, submission id raysubmit_3bby7V8KemepbfwL
05:42:25 INFO - data factory data_ is using S3 data access: input path - test/lang_id/input/, output path - test/lang_id/output/
05:42:25 INFO - data factory data_ max_files -1, n_sample -1
05:42:25 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
05:42:28 INFO - job status is RUNNING
05:42:27 INFO - Launching lang_id transform
05:42:27 INFO - connecting to existing cluster
05:42:27 INFO - lang_id parameters are : {'model_credential': 'PUT YOUR OWN HUGGINGFACE CREDENTIAL', 'model_kind': 'fasttext', 'model_url': 'facebook/fasttext-language-identification', 'content_column_name': 'contents'}
05:42:27 INFO - data factory data_ is using S3 data access: input path - test/lang_id/input/, output path - test/lang_id/output/
05:42:27 INFO - data factory data_ max_files -1, n_sample -1
05:42:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
05:42:27 INFO - pipeline id pipeline_id
05:42:27 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
05:42:27 INFO - number of workers 4 worker options {'num_cpus': 0.8}
05:42:27 INFO - actor creation delay 0
05:42:27 INFO - job details {'job category': 'preprocessing', 'job name': 'lang_id', 'job type': 'ray', 'job id': '290c13d0-a50c-465f-8da0-db8b7fb7711d'}
05:42:27 INFO - Connecting to the existing Ray cluster
2024-06-17 05:42:27,552 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
(orchestrate pid=257, ip=10.244.1.22) 05:42:31 INFO - orchestrator started at 2024-06-17 05:42:31
(orchestrate pid=257, ip=10.244.1.22) 05:42:31 INFO - Number of files is 3, source profile {'max_file_size': 0.3023223876953125, 'min_file_size': 0.037346839904785156, 'total_file_size': 0.4433746337890625}
(orchestrate pid=257, ip=10.244.1.22) 05:42:31 INFO - Cluster resources: {'cpus': 4, 'gpus': 0, 'memory': 12.0, 'object_store': 3.151419066824019}
(orchestrate pid=257, ip=10.244.1.22) 05:42:31 INFO - Number of workers - 4 with {'num_cpus': 0.8} each
(orchestrate pid=257, ip=10.244.1.22) 05:42:31 INFO - Completed 0 files in 1.2238820393880209e-05 min. Waiting for completion
(RayTransformFileProcessor pid=256, ip=10.244.2.27) Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
(RayTransformFileProcessor pid=257, ip=10.244.2.27) Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
(RayTransformFileProcessor pid=360, ip=10.244.1.22) Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
(RayTransformFileProcessor pid=454, ip=10.244.1.22) Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
(RayTransformFileProcessor pid=257, ip=10.244.2.27) 05:42:59 ERROR - Not all required columns are present in the table - required ['ft_lang', 'ft_score'], present ['text', 'count()']
(RayTransformFileProcessor pid=257, ip=10.244.2.27) 05:42:59 WARNING - ERROR: 'Field "contents" does not exist in schema', skipping the file
(orchestrate pid=257, ip=10.244.1.22) 05:43:02 INFO - Completed processing in 0.5180193185806274 min
(orchestrate pid=257, ip=10.244.1.22) 05:43:02 INFO - done flushing in 0.0028450489044189453 sec
(RayTransformFileProcessor pid=360, ip=10.244.1.22) 05:43:02 ERROR - Not all required columns are present in the table - required ['ft_lang', 'ft_score'], present ['text', 'count()']
(RayTransformFileProcessor pid=360, ip=10.244.1.22) 05:43:02 WARNING - ERROR: 'Field "contents" does not exist in schema', skipping the file
(RayTransformFileProcessor pid=454, ip=10.244.1.22) 05:43:02 ERROR - Not all required columns are present in the table - required ['ft_lang', 'ft_score'], present ['text', 'count()']
(RayTransformFileProcessor pid=454, ip=10.244.1.22) 05:43:02 WARNING - ERROR: 'Field "contents" does not exist in schema', skipping the file
05:43:12 INFO - Completed execution in 0.7518523534138998 min, execution result 0
05:43:30 INFO - Job completed with execution status SUCCEEDED

Reproduction script

Running make PYTHON=python3.10 workflow-test under transforms/language/lang_id dir

Anything else

No response

OS

Ubuntu

Python

3.10.x

Are you willing to submit a PR?

blublinsky commented 4 months ago

This has been fixed

revit13 commented 4 months ago

hmm I ran it yesterday and still got this error...