PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
1.1k stars 121 forks source link

[Bug]: SnowStorm-v1.15-4x8B: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000) #493

Open josephrocca opened 5 months ago

josephrocca commented 5 months ago

Your current environment

Using the current official Docker image on Runpod with two 4090s:

alpindale/aphrodite-engine@sha256:b1e72201654a172e044a13d9346264a8b4e562dba8f3572bd92f013cf5420eb1

🐛 Describe the bug

When testing a bunch of models in an eval pipeline, Zoyd/xxx777xxxASD_L3-SnowStorm-v1.15-4x8B-A-5_0bpw_exl2 gave a strange error.

Here is the Runpod template:

alpindale/aphrodite-engine@sha256:b1e72201654a172e044a13d9346264a8b4e562dba8f3572bd92f013cf5420eb1
GPU_MEMORY_UTILIZATION="0.97"
KVCACHE="fp8"
CMD_ADDITIONAL_ARGUMENTS="--quantization exl2 --max-model-len 4096 --max-log-len 1000"
MODEL_NAME="Zoyd/xxx777xxxASD_L3-SnowStorm-v1.15-4x8B-A-5_0bpw_exl2"
REVISION="4d07fb963adef3bbb309dfac2179cfb195971f73"
NUM_GPUS="2"

It was a machine with two RTX 4090s.

Here are the logs:

Click for logs ``` 2024-05-31T12:27:26.071510406Z Starting Aphrodite Engine API server... 2024-05-31T12:27:26.071721103Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model Zoyd/xxx777xxxASD_L3-SnowStorm-v1.15-4x8B-A-5_0bpw_exl2 --revision 4d07fb963adef3bbb309dfac2179cfb195971f73 --kv-cache-dtype fp8 --tensor-parallel-size 2 --gpu-memory-utilization 0.97 --quantization exl2 --max-model-len 4096 --max-log-len 1000 2024-05-31T12:27:27.154744946Z Starting Aphrodite Engine API server... 2024-05-31T12:27:27.154880922Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model Zoyd/xxx777xxxASD_L3-SnowStorm-v1.15-4x8B-A-5_0bpw_exl2 --revision 4d07fb963adef3bbb309dfac2179cfb195971f73 --kv-cache-dtype fp8 --tensor-parallel-size 2 --gpu-memory-utilization 0.97 --quantization exl2 --max-model-len 4096 --max-log-len 1000 2024-05-31T12:27:30.235431861Z /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. 2024-05-31T12:27:30.235457590Z warnings.warn( 2024-05-31T12:27:30.721315412Z WARNING: exl2 quantization is not fully optimized yet. The speed can be slower 2024-05-31T12:27:30.721516341Z than non-quantized models. 2024-05-31T12:27:30.727105617Z INFO: Using fp8 data type to store kv cache. It reduces the GPU memory 2024-05-31T12:27:30.727123421Z footprint and boosts the performance. But it may cause slight accuracy drop 2024-05-31T12:27:30.727127809Z without scaling factors. FP8_E5M2 (without scaling) is only supported on cuda 2024-05-31T12:27:30.727130985Z version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for 2024-05-31T12:27:30.727133821Z common inference criteria. 2024-05-31T12:27:32.478243515Z 2024-05-31 12:27:32,477 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning. 2024-05-31T12:27:32.478294511Z 2024-05-31 12:27:32,478 WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 108.8 to 108. 2024-05-31T12:27:32.713925698Z 2024-05-31 12:27:32,713 INFO worker.py:1749 -- Started a local Ray instance. 2024-05-31T12:27:33.933507694Z INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config: 2024-05-31T12:27:33.933530547Z INFO: Model = 'Zoyd/xxx777xxxASD_L3-SnowStorm-v1.15-4x8B-A-5_0bpw_exl2' 2024-05-31T12:27:33.933537440Z INFO: Speculative Config = None 2024-05-31T12:27:33.933540436Z INFO: DataType = torch.bfloat16 2024-05-31T12:27:33.933542930Z INFO: Model Load Format = auto 2024-05-31T12:27:33.933545595Z INFO: Number of GPUs = 2 2024-05-31T12:27:33.933548842Z INFO: Disable Custom All-Reduce = False 2024-05-31T12:27:33.933551386Z INFO: Quantization Format = exl2 2024-05-31T12:27:33.933553991Z INFO: Context Length = 4096 2024-05-31T12:27:33.933556686Z INFO: Enforce Eager Mode = True 2024-05-31T12:27:33.933559301Z INFO: KV Cache Data Type = fp8 2024-05-31T12:27:33.933561846Z INFO: KV Cache Params Path = None 2024-05-31T12:27:33.933564331Z INFO: Device = cuda 2024-05-31T12:27:33.933567156Z INFO: Guided Decoding Backend = 2024-05-31T12:27:33.933569801Z DecodingConfig(guided_decoding_backend='outlines') 2024-05-31T12:27:36.238747455Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-05-31T12:27:41.171898375Z INFO: Using FlashAttention backend. 2024-05-31T12:27:41.810785583Z (RayWorkerAphrodite pid=6618) INFO: Using FlashAttention backend. 2024-05-31T12:27:41.810831449Z INFO: Aphrodite is using nccl==2.20.5 2024-05-31T12:37:42.149497323Z (RayWorkerAphrodite pid=6618) [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. 2024-05-31T12:37:42.149538120Z (RayWorkerAphrodite pid=6618) [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 94338991729783. 2024-05-31T12:37:42.149692541Z (RayWorkerAphrodite pid=6618) [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2024-05-31T12:37:42.149710014Z (RayWorkerAphrodite pid=6618) [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. 2024-05-31T12:37:42.149713550Z (RayWorkerAphrodite pid=6618) [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. 2024-05-31T12:37:42.149717488Z (RayWorkerAphrodite pid=6618) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): 2024-05-31T12:37:42.149720814Z (RayWorkerAphrodite pid=6618) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e4773897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-05-31T12:37:42.149724892Z (RayWorkerAphrodite pid=6618) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1d819801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.149729270Z (RayWorkerAphrodite pid=6618) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1d81984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.149732907Z (RayWorkerAphrodite pid=6618) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1d8198631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.149740331Z (RayWorkerAphrodite pid=6618) frame #4: + 0xdc253 (0x7f27f84c5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-05-31T12:37:42.149771530Z (RayWorkerAphrodite pid=6618) frame #5: + 0x94ac3 (0x7f27fa337ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.149896665Z (RayWorkerAphrodite pid=6618) frame #6: clone + 0x44 (0x7f27fa3c8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.149952170Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.149957750Z (RayWorkerAphrodite pid=6618) [2024-05-31 12:37:42,071 E 6618 6779] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. 2024-05-31T12:37:42.149962369Z (RayWorkerAphrodite pid=6618) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): 2024-05-31T12:37:42.149965836Z (RayWorkerAphrodite pid=6618) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e4773897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-05-31T12:37:42.150232017Z (RayWorkerAphrodite pid=6618) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f1d819801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.150247416Z (RayWorkerAphrodite pid=6618) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1d81984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.150270239Z (RayWorkerAphrodite pid=6618) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1d8198631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.150273625Z (RayWorkerAphrodite pid=6618) frame #4: + 0xdc253 (0x7f27f84c5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-05-31T12:37:42.150277062Z (RayWorkerAphrodite pid=6618) frame #5: + 0x94ac3 (0x7f27fa337ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.150279847Z (RayWorkerAphrodite pid=6618) frame #6: clone + 0x44 (0x7f27fa3c8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.150283143Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.150286550Z (RayWorkerAphrodite pid=6618) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): 2024-05-31T12:37:42.150289796Z (RayWorkerAphrodite pid=6618) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27e4773897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-05-31T12:37:42.150292441Z (RayWorkerAphrodite pid=6618) frame #1: + 0xe32e33 (0x7f1d81608e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-05-31T12:37:42.150295106Z (RayWorkerAphrodite pid=6618) frame #2: + 0xdc253 (0x7f27f84c5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-05-31T12:37:42.150297721Z (RayWorkerAphrodite pid=6618) frame #3: + 0x94ac3 (0x7f27fa337ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.150300366Z (RayWorkerAphrodite pid=6618) frame #4: clone + 0x44 (0x7f27fa3c8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-05-31T12:37:42.150303241Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.150366039Z (RayWorkerAphrodite pid=6618) [2024-05-31 12:37:42,085 E 6618 6779] logging.cc:108: Stack trace: 2024-05-31T12:37:42.150378984Z (RayWorkerAphrodite pid=6618) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x100fc3a) [0x7f27f9624c3a] ray::operator<<() 2024-05-31T12:37:42.150382681Z (RayWorkerAphrodite pid=6618) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10126f8) [0x7f27f96276f8] ray::TerminateHandler() 2024-05-31T12:37:42.150386107Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f27f849720c] 2024-05-31T12:37:42.150388902Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f27f8497277] 2024-05-31T12:37:42.150391768Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f27f84971fe] 2024-05-31T12:37:42.150394443Z (RayWorkerAphrodite pid=6618) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe32ee4) [0x7f1d81608ee4] c10d::ProcessGroupNCCL::ncclCommWatchdog() 2024-05-31T12:37:42.150397980Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f27f84c5253] 2024-05-31T12:37:42.150404842Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f27fa337ac3] 2024-05-31T12:37:42.150407668Z (RayWorkerAphrodite pid=6618) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f27fa3c8a04] __clone 2024-05-31T12:37:42.150427816Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.150435180Z (RayWorkerAphrodite pid=6618) *** SIGABRT received at time=1717159062 on cpu 36 *** 2024-05-31T12:37:42.150457391Z (RayWorkerAphrodite pid=6618) PC: @ 0x7f27fa3399fc (unknown) pthread_kill 2024-05-31T12:37:42.150460948Z (RayWorkerAphrodite pid=6618) @ 0x7f27fa2e5520 (unknown) (unknown) 2024-05-31T12:37:42.150486627Z (RayWorkerAphrodite pid=6618) [2024-05-31 12:37:42,085 E 6618 6779] logging.cc:365: *** SIGABRT received at time=1717159062 on cpu 36 *** 2024-05-31T12:37:42.150509970Z (RayWorkerAphrodite pid=6618) [2024-05-31 12:37:42,085 E 6618 6779] logging.cc:365: PC: @ 0x7f27fa3399fc (unknown) pthread_kill 2024-05-31T12:37:42.150518286Z (RayWorkerAphrodite pid=6618) [2024-05-31 12:37:42,085 E 6618 6779] logging.cc:365: @ 0x7f27fa2e5520 (unknown) (unknown) 2024-05-31T12:37:42.150534667Z (RayWorkerAphrodite pid=6618) Fatal Python error: Aborted 2024-05-31T12:37:42.150552070Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.150567399Z (RayWorkerAphrodite pid=6618) 2024-05-31T12:37:42.150613135Z (RayWorkerAphrodite pid=6618) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, scipy._lib._ccallback_c, PIL._imaging (total: 32) ```

I tested --load-in-8bit with the original weights instead of the exl2 and got the same error. I'm not sure if this is because Aphrodite doesn't support uncommon MoEs, but I figure that it can't hurt to report this in case it is indeed a bug. Please feel free to close if it's a wontfix for any reason, of course. Thanks!

AlpinDale commented 2 months ago

Can you test this again with v0.6.0 please?