alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.08k stars 357 forks source link

4090 not support #940

Closed cery999 closed 1 year ago

cery999 commented 1 year ago

Please describe the bug

ERROR: test_2_pipeline_parallel (__main__.InstallationTest)                                                                                                                                                                    [0/1927]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/cery/alpa/alpa/test_install.py", line 33, in test_2_pipeline_parallel
    init(cluster="ray")
  File "/home/cery/alpa/alpa/api.py", line 59, in init
    init_global_cluster(cluster, cluster_address, num_nodes,
  File "/home/cery/alpa/alpa/device_mesh.py", line 2326, in init_global_cluster
    ray.init(address=ray_addr,
  File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/worker.py", line 1339, in init
    bootstrap_address = services.canonicalize_bootstrap_address(address, _temp_dir)
  File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/services.py", line 450, in canonicalize_bootstrap_address
    addr = get_ray_address_from_environment(addr, temp_dir)
  File "/home/cery/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/services.py", line 341, in get_ray_address_from_environment
    raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.

Please describe the expected behavior

System information and environment

To Reproduce Steps to reproduce the behavior:

  1. do pyhon3 -m alpa.test_install
  2. See error

Screenshots If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

ZYHowell commented 1 year ago

According to the error message, it seems like you didn't launch ray(use ray start --head on the head node)