alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.07k stars 355 forks source link

Could not find any running Ray instance. #762

Closed RuoyuChen10 closed 1 year ago

RuoyuChen10 commented 1 year ago

Please describe the bug

Please describe the expected behavior

System information and environment

To Reproduce Steps to reproduce the behavior:

  1. Run the following demo:
    
    from transformers import AutoTokenizer
    from llm_serving.model.wrapper import get_model

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b", cache_dir="./pretrained_language_model") tokenizer.add_bos_token = False

Load the model. Alpa automatically downloads the weights to the specificed path

model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")

2. See error
```shell
Traceback (most recent call last):
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cry/data3/Expert-Prompted-Object-Detection/tools/test_language_model.py", line 9, in <module>
    model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")
  File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 643, in get_model
    return get_alpa_model(
  File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 439, in get_alpa_model
    alpa.init()
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/api.py", line 52, in init
    init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 2255, in init_global_cluster
    ray.init(address="auto",
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/worker.py", line 954, in init
    bootstrap_address = services.canonicalize_bootstrap_address(address)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/services.py", line 451, in canonicalize_bootstrap_address
    addr = get_ray_address_from_environment()
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/services.py", line 358, in get_ray_address_from_environment
    addr = _find_gcs_address_or_die()
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/services.py", line 340, in _find_gcs_address_or_die
    raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.

Code snippet to reproduce the problem

# Load the model. Alpa automatically downloads the weights to the specificed path
model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")
RuoyuChen10 commented 1 year ago

This solves after I run:

ray start --head

but after I run this:

from transformers import AutoTokenizer
from llm_serving.model.wrapper import get_model

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b", cache_dir="./pretrained_language_model")
tokenizer.add_bos_token = False

# Load the model. Alpa automatically downloads the weights to the specificed path
model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")

# Generate
prompt = "Paris is the capital city of"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)

when run the command:

output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)

there is an error:

- Load parameters. elapsed: 7.33 second.
(MeshHostWorker pid=10032) 2022-10-31 10:54:17.056406: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:575] Check failed: !info.content.empty() 
(MeshHostWorker pid=10032) *** SIGABRT received at time=1667184857 on cpu 0 ***
(MeshHostWorker pid=10032) PC: @     0x7fad1b5e5e87  (unknown)  raise
(MeshHostWorker pid=10032)     @     0x7fad1c14f980  852956864  (unknown)
(MeshHostWorker pid=10032)     @     0x7fa836319808        800  xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=10032)     @     0x7fa83778cfd4       2688  xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=10032)     @     0x7fa83778ea7f        128  xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=10032)     @     0x7fa83a295746       1312  xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=10032)     @     0x7fa8374a1220       2304  xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032)     @     0x7fa8374a1980        272  xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032)     @     0x7fa8373ea755       2528  xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=10032)     @     0x7fa8373ebba1       5184  xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=10032)     @     0x7fa8373eda0c        960  xla::PjRtStreamExecutorExecutable::Execute()
(MeshHostWorker pid=10032)     @     0x7fa837333f82        544  xla::PyExecutable::ExecuteShardedOnLocalDevices()
(MeshHostWorker pid=10032)     @     0x7fa836661d91        304  pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
(MeshHostWorker pid=10032)     @     0x7fa83663de22        592  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=10032)     @           0x4dfd82  (unknown)  PyCFunction_Call
(MeshHostWorker pid=10032)     @           0x7164c0  (unknown)  (unknown)
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,105 E 10032 10032] logging.cc:325: *** SIGABRT received at time=1667184857 on cpu 0 ***
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,105 E 10032 10032] logging.cc:325: PC: @     0x7fad1b5e5e87  (unknown)  raise
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fad1c14f980  852956864  (unknown)
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa836319808        800  xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa83778cfd4       2688  xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa83778ea7f        128  xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa83a295746       1312  xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa8374a1220       2304  xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa8374a1980        272  xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa8373ea755       2528  xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa8373ebba1       5184  xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa8373eda0c        960  xla::PjRtStreamExecutorExecutable::Execute()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa837333f82        544  xla::PyExecutable::ExecuteShardedOnLocalDevices()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa836661d91        304  pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @     0x7fa83663de22        592  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325:     @           0x4dfd82  (unknown)  PyCFunction_Call
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,108 E 10032 10032] logging.cc:325:     @           0x7164c0  (unknown)  (unknown)
(MeshHostWorker pid=10032) Fatal Python error: Aborted
(MeshHostWorker pid=10032) 
(MeshHostWorker pid=10032) Stack (most recent call first):
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/mesh_executable.py", line 452 in execute_on_worker
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/mesh_executable.py", line 1001 in execute_on_worker
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 272 in run_executable
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 502 in execute_on_worker
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 272 in run_executable
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/function_manager.py", line 675 in actor_method_executor
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/worker.py", line 451 in main_loop
(MeshHostWorker pid=10032)   File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/workers/default_worker.py", line 238 in <module>
2022-10-31 10:54:17,296 WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff542d95af111d74a1604f8b9f01000000 Worker ID: 9a3273284d6c0f6e937713462667985736af6337a1c29e3202ce7af5 Node ID: 2ffe6dd1f9bf540b4e18f5c5fa824cbb0bf0c8940855777e929de479 Worker IP address: 192.168.114.62 Worker port: 10007 Worker PID: 10032
Traceback (most recent call last):
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cry/data3/Expert-Prompted-Object-Detection/tools/test_language_model.py", line 15, in <module>
    output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/transformers/generation_utils.py", line 1422, in generate
    return self.sample(
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/transformers/generation_utils.py", line 2035, in sample
    outputs = self(
  File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 110, in __call__
    ret = self.inference_func(input_ids,
  File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 585, in inference_func
    logits_step = torch.from_numpy(np.array(output.logits)).to(torch_device).float()
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1610, in __array__
    return np.asarray(self._value, dtype=dtype)
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1596, in _value
    fetched_np_buffers = self.device_mesh.get_remote_buffers(
  File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1164, in get_remote_buffers
    self.workers[host_id].get_buffers.remote(
TypeError: 'NoneType' object is not subscriptable

How do I deal with it? Thank you.

zhisbug commented 1 year ago

Can you pass alpa installation tests: https://alpa.ai/install.html#check-installation ?

zhisbug commented 1 year ago

@RuoyuChen10 is your problem solved? Can you pass the alpa install tests?

zhisbug commented 1 year ago

@RuoyuChen10 check again to see -- have you solved this problem?

asmith26 commented 1 year ago

Hi @zhisbug,

Can you pass alpa installation tests: https://alpa.ai/install.html#check-installation ?

I'm trying to install Alpa using Kaggle Kernels (running GPU T4 x2), and when I run the alpa installation tests I get the following error (I've tried running in 2 separate Notebook cells, and in the same):

!ray start --head
!python -m alpa.test_install
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.19.2.2

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.19.2.2:6379'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')

  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:10001')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop
2022-11-17 22:16:22.422757: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
.E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/alpa/test_install.py", line 33, in test_2_pipeline_parallel
    init(cluster="ray")
  File "/opt/conda/lib/python3.7/site-packages/alpa/api.py", line 52, in init
    init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
  File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2257, in init_global_cluster
    namespace=namespace)
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 954, in init
    bootstrap_address = services.canonicalize_bootstrap_address(address)
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 451, in canonicalize_bootstrap_address
    addr = get_ray_address_from_environment()
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 358, in get_ray_address_from_environment
    addr = _find_gcs_address_or_die()
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 341, in _find_gcs_address_or_die
    "Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.

----------------------------------------------------------------------
Ran 2 tests in 6.641s

FAILED (errors=1)

Not sure if you can spot the error, many thanks for any help! :)

merrymercy commented 1 year ago

You can find this warning message in your log. Could you update your cuda toolkit or ptxas?

2022-11-17 22:16:22.422757: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

The ray connection error "ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address flag or RAY_ADDRESS environment variable." seems wired. I don't know whether ray is compatible with Kaggle kernel. Maybe you can get some help from the ray community (doc, help)?

asmith26 commented 1 year ago

Thanks very much for the info @merrymercy. I think I've got a little further, but it does appear the CPU resources on Kaggle Kernels/Notebooks are very limited:

# CPU
model name  : Intel(R) Xeon(R) CPU @ 2.00GHz
cpu MHz     : 2000.140
cpu cores   : 1
MemTotal:       16390868 kB

# GPU
Sun Nov 20 14:39:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here's my output if you have anymore ideas (thanks for any help in advance! :)


Ray seems to start fine:

$ ray start --head` yeilds

Local node IP: 172.19.2.2
--------------------
Ray runtime started.
--------------------
Next steps
...

I found that Kaggle did not seem to like ray.init(address="auto") but does like:

>>> import ray
>>> ray.init(ignore_reinit_error=True, num_gpus=2, num_cpus=1)

RayContext(dashboard_url='', python_version='3.7.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '172.19.2.2', 'raylet_ip_address': '172.19.2.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148', 'metrics_export_port': 56336, 'gcs_address': '172.19.2.2:55484', 'address': '172.19.2.2:55484', 'node_id': '9e0cc00fb9ab5ce2fb3b6b3a070598c3b35602ee98c1bc7335cc1f6d'})

Then I tried:

>>> from alpa.api import init
>>> ​init(cluster="ray", num_nodes=1, num_devices_per_node=1)

$ python -m alpa.test_install

2022-11-20 14:32:32.722239: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
.E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/alpa/test_install.py", line 33, in test_2_pipeline_parallel
    init(cluster="ray")
  File "/opt/conda/lib/python3.7/site-packages/alpa/api.py", line 52, in init
    init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
  File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2260, in init_global_cluster
    namespace)
  File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2135, in __init__
    self.num_hosts, self.host_num_devices, pg_name)
  File "/opt/conda/lib/python3.7/site-packages/alpa/util.py", line 1534, in create_placement_group
    "Placement group creation timed out. Make sure your "
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. If you are running on a cluster, make sure you specify an address in `ray.init()`, for example, `ray.init("auto")`. You can also increase the timeout by setting the ALPA_PLACEMENT_GROUP_TIMEOUT_S environment variable. Current resources available: {'node:172.19.2.2': 1.0, 'GPU': 1.0, 'memory': 9483205019.0, 'accelerator_type:T4': 1.0, 'bundle_group_259a5719bd85972bbd28a622470001000000': 1000.0, 'GPU_group_0_259a5719bd85972bbd28a622470001000000': 1.0, 'CPU_group_259a5719bd85972bbd28a622470001000000': 1.0, 'bundle_group_0_259a5719bd85972bbd28a622470001000000': 1000.0, 'object_store_memory': 4741602508.0, 'CPU_group_0_259a5719bd85972bbd28a622470001000000': 1.0, 'GPU_group_259a5719bd85972bbd28a622470001000000': 1.0}, resources requested by the placement group: [{'CPU': 1.0, 'GPU': 2.0}]

----------------------------------------------------------------------
Ran 2 tests in 108.457s

FAILED (errors=1)
merrymercy commented 1 year ago

closed due to inactivity