Closed RuoyuChen10 closed 1 year ago
This solves after I run:
ray start --head
but after I run this:
from transformers import AutoTokenizer
from llm_serving.model.wrapper import get_model
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b", cache_dir="./pretrained_language_model")
tokenizer.add_bos_token = False
# Load the model. Alpa automatically downloads the weights to the specificed path
model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")
# Generate
prompt = "Paris is the capital city of"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
when run the command:
output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
there is an error:
- Load parameters. elapsed: 7.33 second.
(MeshHostWorker pid=10032) 2022-10-31 10:54:17.056406: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:575] Check failed: !info.content.empty()
(MeshHostWorker pid=10032) *** SIGABRT received at time=1667184857 on cpu 0 ***
(MeshHostWorker pid=10032) PC: @ 0x7fad1b5e5e87 (unknown) raise
(MeshHostWorker pid=10032) @ 0x7fad1c14f980 852956864 (unknown)
(MeshHostWorker pid=10032) @ 0x7fa836319808 800 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=10032) @ 0x7fa83778cfd4 2688 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=10032) @ 0x7fa83778ea7f 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=10032) @ 0x7fa83a295746 1312 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=10032) @ 0x7fa8374a1220 2304 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) @ 0x7fa8374a1980 272 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) @ 0x7fa8373ea755 2528 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=10032) @ 0x7fa8373ebba1 5184 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=10032) @ 0x7fa8373eda0c 960 xla::PjRtStreamExecutorExecutable::Execute()
(MeshHostWorker pid=10032) @ 0x7fa837333f82 544 xla::PyExecutable::ExecuteShardedOnLocalDevices()
(MeshHostWorker pid=10032) @ 0x7fa836661d91 304 pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
(MeshHostWorker pid=10032) @ 0x7fa83663de22 592 pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=10032) @ 0x4dfd82 (unknown) PyCFunction_Call
(MeshHostWorker pid=10032) @ 0x7164c0 (unknown) (unknown)
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,105 E 10032 10032] logging.cc:325: *** SIGABRT received at time=1667184857 on cpu 0 ***
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,105 E 10032 10032] logging.cc:325: PC: @ 0x7fad1b5e5e87 (unknown) raise
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fad1c14f980 852956864 (unknown)
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa836319808 800 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa83778cfd4 2688 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa83778ea7f 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa83a295746 1312 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa8374a1220 2304 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa8374a1980 272 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa8373ea755 2528 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa8373ebba1 5184 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa8373eda0c 960 xla::PjRtStreamExecutorExecutable::Execute()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa837333f82 544 xla::PyExecutable::ExecuteShardedOnLocalDevices()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa836661d91 304 pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x7fa83663de22 592 pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,106 E 10032 10032] logging.cc:325: @ 0x4dfd82 (unknown) PyCFunction_Call
(MeshHostWorker pid=10032) [2022-10-31 10:54:17,108 E 10032 10032] logging.cc:325: @ 0x7164c0 (unknown) (unknown)
(MeshHostWorker pid=10032) Fatal Python error: Aborted
(MeshHostWorker pid=10032)
(MeshHostWorker pid=10032) Stack (most recent call first):
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/mesh_executable.py", line 452 in execute_on_worker
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/mesh_executable.py", line 1001 in execute_on_worker
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 272 in run_executable
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 502 in execute_on_worker
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 272 in run_executable
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462 in _resume_span
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/_private/function_manager.py", line 675 in actor_method_executor
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/worker.py", line 451 in main_loop
(MeshHostWorker pid=10032) File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/ray/workers/default_worker.py", line 238 in <module>
2022-10-31 10:54:17,296 WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff542d95af111d74a1604f8b9f01000000 Worker ID: 9a3273284d6c0f6e937713462667985736af6337a1c29e3202ce7af5 Node ID: 2ffe6dd1f9bf540b4e18f5c5fa824cbb0bf0c8940855777e929de479 Worker IP address: 192.168.114.62 Worker port: 10007 Worker PID: 10032
Traceback (most recent call last):
File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/cry/anaconda3/envs/epod/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/cry/data3/Expert-Prompted-Object-Detection/tools/test_language_model.py", line 15, in <module>
output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/transformers/generation_utils.py", line 1422, in generate
return self.sample(
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/transformers/generation_utils.py", line 2035, in sample
outputs = self(
File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 110, in __call__
ret = self.inference_func(input_ids,
File "/home/cry/data3/Expert-Prompted-Object-Detection/alpa/examples/llm_serving/model/wrapper.py", line 585, in inference_func
logits_step = torch.from_numpy(np.array(output.logits)).to(torch_device).float()
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1610, in __array__
return np.asarray(self._value, dtype=dtype)
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1596, in _value
fetched_np_buffers = self.device_mesh.get_remote_buffers(
File "/home/cry/anaconda3/envs/epod/lib/python3.8/site-packages/alpa/device_mesh.py", line 1164, in get_remote_buffers
self.workers[host_id].get_buffers.remote(
TypeError: 'NoneType' object is not subscriptable
How do I deal with it? Thank you.
Can you pass alpa installation tests: https://alpa.ai/install.html#check-installation ?
@RuoyuChen10 is your problem solved? Can you pass the alpa install tests?
@RuoyuChen10 check again to see -- have you solved this problem?
Hi @zhisbug,
Can you pass alpa installation tests: https://alpa.ai/install.html#check-installation ?
I'm trying to install Alpa using Kaggle Kernels (running GPU T4 x2), and when I run the alpa installation tests I get the following error (I've tried running in 2 separate Notebook cells, and in the same):
!ray start --head
!python -m alpa.test_install
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 172.19.2.2
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='172.19.2.2:6379'
Alternatively, use the following Python code:
import ray
ray.init(address='auto')
To connect to this Ray runtime from outside of the cluster, for example to
connect to a remote cluster from your laptop directly, use the following
Python code:
import ray
ray.init(address='ray://<head_node_ip_address>:10001')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
2022-11-17 22:16:22.422757: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.
You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
.E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/alpa/test_install.py", line 33, in test_2_pipeline_parallel
init(cluster="ray")
File "/opt/conda/lib/python3.7/site-packages/alpa/api.py", line 52, in init
init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2257, in init_global_cluster
namespace=namespace)
File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 954, in init
bootstrap_address = services.canonicalize_bootstrap_address(address)
File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 451, in canonicalize_bootstrap_address
addr = get_ray_address_from_environment()
File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 358, in get_ray_address_from_environment
addr = _find_gcs_address_or_die()
File "/opt/conda/lib/python3.7/site-packages/ray/_private/services.py", line 341, in _find_gcs_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.
----------------------------------------------------------------------
Ran 2 tests in 6.641s
FAILED (errors=1)
Not sure if you can spot the error, many thanks for any help! :)
You can find this warning message in your log. Could you update your cuda toolkit or ptxas?
2022-11-17 22:16:22.422757: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.
The ray connection error "ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address
flag or RAY_ADDRESS
environment variable." seems wired. I don't know whether ray is compatible with Kaggle kernel. Maybe you can get some help from the ray community (doc, help)?
Thanks very much for the info @merrymercy. I think I've got a little further, but it does appear the CPU resources on Kaggle Kernels/Notebooks are very limited:
# CPU
model name : Intel(R) Xeon(R) CPU @ 2.00GHz
cpu MHz : 2000.140
cpu cores : 1
MemTotal: 16390868 kB
# GPU
Sun Nov 20 14:39:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here's my output if you have anymore ideas (thanks for any help in advance! :)
Ray seems to start fine:
$ ray start --head` yeilds
Local node IP: 172.19.2.2
--------------------
Ray runtime started.
--------------------
Next steps
...
I found that Kaggle did not seem to like ray.init(address="auto")
but does like:
>>> import ray
>>> ray.init(ignore_reinit_error=True, num_gpus=2, num_cpus=1)
RayContext(dashboard_url='', python_version='3.7.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '172.19.2.2', 'raylet_ip_address': '172.19.2.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-11-20_13-57-02_103301_8148', 'metrics_export_port': 56336, 'gcs_address': '172.19.2.2:55484', 'address': '172.19.2.2:55484', 'node_id': '9e0cc00fb9ab5ce2fb3b6b3a070598c3b35602ee98c1bc7335cc1f6d'})
Then I tried:
>>> from alpa.api import init
>>> init(cluster="ray", num_nodes=1, num_devices_per_node=1)
$ python -m alpa.test_install
2022-11-20 14:32:32.722239: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.
You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
.E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/alpa/test_install.py", line 33, in test_2_pipeline_parallel
init(cluster="ray")
File "/opt/conda/lib/python3.7/site-packages/alpa/api.py", line 52, in init
init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2260, in init_global_cluster
namespace)
File "/opt/conda/lib/python3.7/site-packages/alpa/device_mesh.py", line 2135, in __init__
self.num_hosts, self.host_num_devices, pg_name)
File "/opt/conda/lib/python3.7/site-packages/alpa/util.py", line 1534, in create_placement_group
"Placement group creation timed out. Make sure your "
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. If you are running on a cluster, make sure you specify an address in `ray.init()`, for example, `ray.init("auto")`. You can also increase the timeout by setting the ALPA_PLACEMENT_GROUP_TIMEOUT_S environment variable. Current resources available: {'node:172.19.2.2': 1.0, 'GPU': 1.0, 'memory': 9483205019.0, 'accelerator_type:T4': 1.0, 'bundle_group_259a5719bd85972bbd28a622470001000000': 1000.0, 'GPU_group_0_259a5719bd85972bbd28a622470001000000': 1.0, 'CPU_group_259a5719bd85972bbd28a622470001000000': 1.0, 'bundle_group_0_259a5719bd85972bbd28a622470001000000': 1000.0, 'object_store_memory': 4741602508.0, 'CPU_group_0_259a5719bd85972bbd28a622470001000000': 1.0, 'GPU_group_259a5719bd85972bbd28a622470001000000': 1.0}, resources requested by the placement group: [{'CPU': 1.0, 'GPU': 2.0}]
----------------------------------------------------------------------
Ran 2 tests in 108.457s
FAILED (errors=1)
closed due to inactivity
Please describe the bug
Please describe the expected behavior
System information and environment
To Reproduce Steps to reproduce the behavior:
Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b", cache_dir="./pretrained_language_model") tokenizer.add_bos_token = False
Load the model. Alpa automatically downloads the weights to the specificed path
model = get_model(model_name="alpa/opt-2.7b", path="./pretrained_language_model/")
Code snippet to reproduce the problem