@daw3rd Created an issue and writing a few follow up experiments I performed which I thought are very relevant to this problem.
Start a ray cluster in the virtual env by activating the venv in the tokenization/ray directory and running ray start --head , then using the --run-locally False flag in the make command and then running the make run-cli-sample . And that works! I am able to connect to the ray cluster remotely. However there is this problem of handling input file paths and I get a new error :
(venv) [kanishka@ml-pipelines ray]$ make run-cli-sample
make \
RUN_ARGS="--run_locally False --data_local_config \"{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}\" \
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate; \
cd src; \
python --run_locally False --data_local_config "{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
23:08:59 INFO - Launching Tokenization transform
23:08:59 INFO - connecting to existing cluster
23:08:59 INFO - data factory data_ is using local data access: input_folder - ../test-data/ds01/input output_folder - ../output
23:08:59 INFO - data factory data_ max_files -1, n_sample -1
23:08:59 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:08:59 INFO - pipeline id pipeline_id
23:08:59 INFO - code location None
23:08:59 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
23:08:59 INFO - actor creation delay 0
23:08:59 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
23:08:59 INFO - Connecting to the existing Ray cluster
2024-07-02 23:08:59,421 INFO -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
(orchestrate pid=510039) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(orchestrate pid=510039) 23:09:04 INFO - orchestrator started at 2024-07-02 23:09:04
(orchestrate pid=510039) 23:09:04 ERROR - No input files to process - exiting
23:09:14 INFO - Completed execution in 0.24884503682454426 min, execution result 0
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
Start a ray cluster in the virtual env by activating the venv in the tokenization/ray directory and running ray start --head .Tried the same thing with the make run-s3-sample to see whether minio can solve my issue of file paths by changing the launcher params to "run_locally": False , but the same issue I faced, I was able to connect to that cluster but accessing the files was the issue:
(venv) [kanishka@ml-pipelines ray]$ make run-s3-sample
make .defaults.minio.verify-running
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate; \
cd src; \
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
23:16:17 INFO - connecting to existing cluster
23:16:17 INFO - data factory data_ is using S3 data access: input path - test/tokenization/ds01/input, output path - test/tokenization/ds01/output
23:16:17 INFO - data factory data_ max_files -1, n_sample -1
23:16:17 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:16:17 INFO - pipeline id pipeline_id
23:16:17 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
23:16:17 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}
23:16:17 INFO - actor creation delay 0
23:16:17 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
23:16:17 INFO - Connecting to the existing Ray cluster
2024-07-02 23:16:17,296 INFO -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
(orchestrate pid=510035) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(orchestrate pid=510035) 23:16:23 INFO - orchestrator started at 2024-07-02 23:16:23
(orchestrate pid=510035) 23:16:23 ERROR - No input files to process - exiting
23:16:33 INFO - Completed execution in 0.2762212514877319 min, execution result 0
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
You may want to stop the minio server now (see make help)
So I am guessing handling of local files is the issue here. Let me know if there are some potential work arounds for that I could try.
3. The last thing I tried was to deactivate the venv run a ray cluster locally on my machine outside of venv using the same version v2.24.0 of ray used in dpk. Did not get the JobConfig Error but i still was not able to connect to it. logs:
(venv) [kanishka@ml-pipelines ray]$ make run-cli-sample
make \
RUN_ARGS="--run_locally False --data_local_config \"{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}\" \
make[1]: Entering directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
source venv/bin/activate; \
cd src; \
python --run_locally False --data_local_config "{ 'input_folder' : '../test-data/ds01/input', 'output_folder' : '../output'}"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
22:56:37 INFO - Launching Tokenization transform
22:56:37 INFO - connecting to existing cluster
22:56:37 INFO - data factory data_ is using local data access: input_folder - ../test-data/ds01/input output_folder - ../output
22:56:37 INFO - data factory data_ max_files -1, n_sample -1
22:56:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
22:56:37 INFO - pipeline id pipeline_id
22:56:37 INFO - code location None
22:56:37 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
22:56:37 INFO - actor creation delay 0
22:56:37 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
22:56:37 INFO - Connecting to the existing Ray cluster
2024-07-02 22:56:37,488 INFO -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
SIGTERM handler is not set because current thread is not the main thread.
Put failed:
22:56:41 INFO - Exception running ray remote orchestration
No module named 'data_processing_ray'
22:56:41 INFO - Completed execution in 0.06163370609283447 min, execution result 1
make[1]: *** [../../../../.make.defaults:374:] Error 1
make[1]: Leaving directory '/mnt/xvdc/work/data-prep-kit/transforms/universal/tokenization/ray'
make: *** [Makefile:43: run-cli-sample] Error 2
So the current problems in dpk are that:
- It is not possible to connect to a remote ray cluster to execute some transform (a feature which is useful if there is a need to execute computationally large transform on a distributed env)
- Data passage between a local storage to a remote ray runtime is not handled well. (or maybe there are some config changes I am missing in which case feel free to correct me : ) )
