ERROR: test_2_pipeline_parallel (__main__.InstallationTest) [0/1927]
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/cery/alpa/alpa/test_install.py", line 33, in test_2_pipeline_parallel
init(cluster="ray")
File "/home/cery/alpa/alpa/api.py", line 59, in init
init_global_cluster(cluster, cluster_address, num_nodes,
File "/home/cery/alpa/alpa/device_mesh.py", line 2326, in init_global_cluster
ray.init(address=ray_addr,
File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/worker.py", line 1339, in init
bootstrap_address = services.canonicalize_bootstrap_address(address, _temp_dir)
File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/services.py", line 450, in canonicalize_bootstrap_address
addr = get_ray_address_from_environment(addr, temp_dir)
File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/services.py", line 341, in get_ray_address_from_environment
raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.
Please describe the expected behavior
System information and environment
OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): linux
Python version:3.8.10
CUDA version:11.3
NCCL version:2.6.1
cupy version:cupy-11x
GPU model and memory: 4090 24G
Alpa version: 0.2.3
TensorFlow version:
JAX version: 3.2.2
To Reproduce
Steps to reproduce the behavior:
do pyhon3 -m alpa.test_install
See error
Screenshots
If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
Please describe the bug
Please describe the expected behavior
System information and environment
To Reproduce Steps to reproduce the behavior:
Screenshots If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.