alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.05k stars 353 forks source link

Fail to run alpa test #937

Open gaow0007 opened 1 year ago

gaow0007 commented 1 year ago

Please describe the bug

Please describe the expected behavior

System information and environment

To Reproduce Steps to reproduce the behavior:

  1. python3 -m alpa.test_install
  2. See error

Screenshots If applicable, add screenshots to help explain your problem.

2023-06-17 22:59:20,085 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 155.69.142.146:6379...
2023-06-17 22:59:20,120 INFO worker.py:1528 -- Connected to Ray cluster.
(raylet) [2023-06-17 22:59:27,687 E 25332 25478] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-06-17_22-09-42_273283_25013 is over 95% full, available space: 21533958144; capacity: 730542596096. Object creation will fail if spilling is required.
EException ignored in: <function PipeshardDriverExecutable.__del__ at 0x7fe295cbc940>
Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 434, in __del__
2023-06-17 22:59:29,665 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=16323, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fbdcf679430>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
    mesh.delete_remote_executable(self.exec_uuid)
AttributeError: 'PipeshardDriverExecutable' object has no attribute 'exec_uuid'

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 65, in <module>
    runner.run(suite())
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/runner.py", line 176, in run
    test(result)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 122, in run
    test(result)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 736, in __call__
    return self.run(*args, **kwds)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 676, in run
    self._callTestMethod(testMethod)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 121, in __call__
    self._decode_args_and_get_executable(*args))
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/linear_util.py", line 309, in memoized_fun
    ans = call(fun, *args)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 223, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/parallel_method.py", line 240, in compile_executable
    return compile_pipeshard_executable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 105, in __init__
    task.create_resharding_communicators()
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.923s

FAILED (errors=1)

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

oscardddd commented 3 weeks ago

Hi! I encountered a similar error, did you find a solution to this?