Closed mrseeker closed 1 month ago
If #265 fixes this issue, let me know, and then I will be happy to test it out.
The error log isn't very helpful. It may give you more info if you kill the server (async moment). It could be due an internal timeout, but hard to tell with this error.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1442, OpType=GATHER, NumelIn=16000, NumelOut=16000, Timeout(ms)=600000) ran for 600251 milliseconds before timing out.
--
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1442, OpType=GATHER, NumelIn=16000, NumelOut=16000, Timeout(ms)=600000) ran for 600251 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f860ad81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f85c07df6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f85c07e2c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f85c07e3839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f860a4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f860c341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f860c3d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[2024-03-25 13:52:52,844 E 1 6316] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1442, OpType=GATHER, NumelIn=16000, NumelOut=16000, Timeout(ms)=600000) ran for 600251 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f860ad81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f85c07df6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f85c07e2c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f85c07e3839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f860a4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f860c341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f860c3d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f860ad81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7f85c0539b11 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f860a4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f860c341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f860c3d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[2024-03-25 13:52:52,860 E 1 6316] logging.cc:104: Stack trace: /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe543a) [0x7f84b00f043a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe7b78) [0x7f84b00f2b78] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f860a48220c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f860a482277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f860a4821fe]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xdf6bcc) [0x7f85c0539bcc] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f860a4b0253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f860c341ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f860c3d2a04] __clone
*** SIGABRT received at time=1711374772 on cpu 76 ***
PC: @ 0x7f860c3439fc (unknown) pthread_kill @ 0x7f860c2ef520 (unknown) (unknown)
[2024-03-25 13:52:52,861 E 1 6316] logging.cc:361: *** SIGABRT received at time=1711374772 on cpu 76 ***
[2024-03-25 13:52:52,861 E 1 6316] logging.cc:361: PC: @ 0x7f860c3439fc (unknown) pthread_kill
[2024-03-25 13:52:52,861 E 1 6316] logging.cc:361: @ 0x7f860c2ef520 (unknown) (unknown)
Fatal Python error: Aborted
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 157)
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f860c2d5898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1711374772 on cpu 76 ***
PC: @ 0x7f860c2d5898 (unknown) abort @ 0x7f860c2ef520 (unknown) (unknown) @ 0x7f844056d640 (unknown) (unknown)
[2024-03-25 13:52:52,867 E 1 6316] logging.cc:361: *** SIGSEGV received at time=1711374772 on cpu 76 ***
[2024-03-25 13:52:52,867 E 1 6316] logging.cc:361: PC: @ 0x7f860c2d5898 (unknown) abort
[2024-03-25 13:52:52,870 E 1 6316] logging.cc:361: @ 0x7f860c2ef520 (unknown) (unknown)
[2024-03-25 13:52:52,873 E 1 6316] logging.cc:361: @ 0x7f844056d640 (unknown) (unknown)
Fatal Python error: Segmentation fault
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 157)
Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 8080 --download-dir /tmp/hub --model <REDACTED> --tensor-parallel-size 8 --served-model-name <REDACTED>
2024-03-25 13:53:03,923#011INFO worker.py:1752 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = '<REDACTED>'
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 8
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
#033[36m(RayWorkerAphrodite pid=6025)#033[0m INFO: Downloading model weights ['*.safetensors']
INFO: Downloading model weights ['*.safetensors']
INFO: Model weights loaded. Memory usage: 1.70 GiB x 8 = 13.57 GiB
#033[36m(RayWorkerAphrodite pid=5797)#033[0m INFO: Model weights loaded. Memory usage: 1.70 GiB x 8 = 13.57 GiB
INFO: # GPU blocks: 115810, # CPU blocks: 16384
INFO: Minimum concurrency: 56.55x
INFO: Maximum sequence length allowed in the cache: 1852960
INFO: Capturing the model for CUDA graphs. This may lead to unexpected
consequences if the model is not static. To run the model in eager mode, set
'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING: CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are
running out of memory, consider decreasing `gpu_memory_utilization` or enforcing
eager mode.
Seems like a timeout error. Did you have a sequence that took longer than 60 seconds to process? As a hotfix, you can increase the timeout threshold:
export APHRODITE_ENGINE_ITERATION_TIMEOUT_S=120
That would set the limit to 120 seconds. You can of course pass it as an env variable to the docker image.
Currently running on AWS. My setup:
And it crashes on a 60sec timeout, meaning it hogs a single GPU instead of distributing the load.
Normal response time for 1 GPU with 1 client is around 6 seconds.
Most issues fixed with v0.6.0
Your current environment
🐛 Describe the bug
When I set
NUM_GPUS
to 8 (due to having a server with 8 GPU's) I get the following error (sorry but the system hates to properly log errors):The annoying part is that the server is not stopped, and the "health" still shows 200 (should not be the case, since the backend crashed).