[Bug]: With tp > 1, model never loads, and there's hardly any CPU utilization

BlairSadewitz commented 2 weeks ago

Your current environment

The output of `python env.py`

PyTorch version: 2.3.1 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (conda-forge gcc 11.3.0-19) 11.3.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 NVL GPU 1: NVIDIA H100 NVL

Nvidia driver version: 550.107.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 9374F 32-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 4304.9312 CPU min MHz: 1500.0000 BogoMIPS: 7688.39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total

🐛 Describe the bug

On the rc_054 branch, loading a model with tp > 1 seems to just sit idle. That is:

with -tp 2:

WARNING: Launching Kobold API server in addition to OpenAI. Keep in mind that the Kobold API routes are NOT protected via the API key. INFO: Defaulting to use mp for distributed inference. INFO: ------------------------------------------------------------------------------------- INFO: Initializing Aphrodite Engine (v0.5.4-dev commit 848731f) with the following config: INFO: Model = 'KoboldAI/GPT-NeoX-20B-Erebus' INFO: DataType = torch.float16 INFO: Tensor Parallel Size = 2 INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = False INFO: Context Length = 2048 INFO: Enforce Eager Mode = True INFO: Prefix Caching = False INFO: Device = device(type='cuda') INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') INFO: ------------------------------------------------------------------------------------- /root/micromamba/envs/aphrodite/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( WARNING: Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (AphroditeWorkerProcess pid=25024) INFO: Worker ready; awaiting tasks INFO: generating GPU P2P access cache in /root/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json

from top(1):

24897 root 20 0 25.5g 1.0g 449536 S 0.7 0.1 0:09.88 pt_main_thread
25024 root 20 0 25.2g 779388 189016 S 0.7 0.0 0:03.11 pt_main_thread

[ ... and she just kinda sits there ]

With -tp 1:

INFO: ------------------------------------------------------------------------------------- INFO: Initializing Aphrodite Engine (v0.5.4-dev commit 848731f) with the following config: INFO: Model = 'KoboldAI/GPT-NeoX-20B-Erebus' INFO: DataType = torch.float16 INFO: Tensor Parallel Size = 1 INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = False INFO: Context Length = 2048 INFO: Enforce Eager Mode = True INFO: Prefix Caching = False INFO: Device = device(type='cuda') INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') INFO: ------------------------------------------------------------------------------------- /root/micromamba/envs/aphrodite/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( INFO: Loading model KoboldAI/GPT-NeoX-20B-Erebus... INFO: Using model weights format ['*.bin'] Loading pt checkpoint shards: [... she behaves as expected]

AlpinDale commented 2 weeks ago

I've noticed the same happening sometimes, I haven't really figured out why... the only solution seems to be downloading the model to disk, and loading it from there. Can you try spamming Ctrl+C the next time this happens so we can see where its hanging?

BlairSadewitz commented 1 week ago

Yeah, I'll give it a go. What do you mean by downloading it to disk? I am using huggingface_hub to download it.

AlpinDale commented 1 week ago

huggingface-cli download <model id> --local-dir /path/to/download/dir/model-name

Then load from that directory.

BlairSadewitz commented 1 week ago

I was playing around with different quantization, so I had this one ready to go. Behaves the same way if there's no quantization, tho.

Also happens if I use ray as the backend. Does NOT happen if I'm only using one GPU.

root@C.12233482:~$ aphrodite run --launch-kobold-api --quantization fbgemm_fp8 -tp 2 ./Llama-3.1-70B-Instruct-Lorablated-Creative-Writer-fbgemm_fp8/
WARNING: Casting torch.bfloat16 to torch.float16.
INFO: Defaulting to use mp for distributed inference.
WARNING: The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller
value.
INFO: -------------------------------------------------------------------------------------
INFO: Initializing Aphrodite Engine (v0.5.4-dev commit 208cd540) with the following config:
INFO: Model = './Llama-3.1-70B-Instruct-Lorablated-Creative-Writer-fbgemm_fp8/'
INFO: DataType = torch.bfloat16
INFO: Tensor Parallel Size = 2
INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = 'fbgemm_fp8'
INFO: Context Length = 131072
INFO: Enforce Eager Mode = True
INFO: Prefix Caching = False
INFO: Device = device(type='cuda')
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') INFO: -------------------------------------------------------------------------------------
WARNING: Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(AphroditeWorkerProcess pid=12398) INFO: Worker ready; awaiting tasks
INFO: generating GPU P2P access cache in /root/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
^CINFO: Terminating local Aphrodite worker processes
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run [53/1928] return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 633, in run_server
async with build_async_engine_client(args) as async_engine_client:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 129, in build_async_engine_client
await async_engine_client.setup()
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/client.py", line 34, in setup
await self.wait_for_server()
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/client.py", line 117, in wait_for_server
await self._send_one_way_rpc_request(
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/client.py", line 97, in _send_one_way_rpc_request
response = cloudpickle.loads(await socket.recv())
^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/bin/aphrodite", line 8, in
sys.exit(main())
^^^^^^
File "/root/aphrodite-engine/aphrodite/endpoints/cli.py", line 205, in main
args.dispatch_function(args)
File "/root/aphrodite-engine/aphrodite/endpoints/cli.py", line 31, in serve
asyncio.run(run_server(args))
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 123, in run
raise KeyboardInterrupt()
KeyboardInterrupt
Process Process-1:
Traceback (most recent call last):
File "/root/aphrodite-engine/aphrodite/distributed/device_communicators/custom_all_reduce_utils.py", line 216, in gpu_p2p_access_check
returned.check_returncode()
File "/opt/conda/lib/python3.11/subprocess.py", line 502, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '/root/aphrodite-engine/aphrodite/distributed/device_communicators/custom_all_reduce_utils.py']' died with <Signals.SIGINT: 2>.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/server.py", line 204, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, port)
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/server.py", line 204, in run_rpc_server [0/1928] server = AsyncEngineRPCServer(async_engine_args, port)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/endpoints/openai/rpc/server.py", line 23, in init
self.engine = AsyncAphrodite.from_engine_args(async_engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 470, in from_engine_args
engine = cls(
^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 379, in init
self.engine = self._init_engine(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 550, in _init_engine
return engine_class(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 243, in init
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 212, in init
super().init(*args, kwargs)
File "/root/aphrodite-engine/aphrodite/executor/distributed_gpu_executor.py", line 24, in init
super().init(*args, *kwargs)
File "/root/aphrodite-engine/aphrodite/executor/executor_base.py", line 47, in init
self._init_executor()
File "/root/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 136, in _init_executor
self._run_workers("init_device")
File "/root/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 189, in _run_workers
driver_worker_output = driver_worker_method(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 139, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 370, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/root/aphrodite-engine/aphrodite/distributed/parallel_state.py", line 962, in ensure_model_parallel_initialized
initialize_model_parallel(tensor_model_parallel_size,
File "/root/aphrodite-engine/aphrodite/distributed/parallel_state.py", line 928, in initialize_model_parallel _TP = init_model_parallel_group(group_ranks,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/aphrodite-engine/aphrodite/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/distributed/parallel_state.py", line 164, in init
self.ca_comm = CustomAllreduce(
^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/distributed/device_communicators/custom_all_reduce.py", line 126, in init
if not _can_p2p(rank, world_size):
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/distributed/device_communicators/custom_all_reduce.py", line 29, in _can_p2p
if not gpu_p2p_access_check(rank, i):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/distributed/device_communicators/custom_all_reduce_utils.py", line 219, in gpu_p2p_access_check
raise RuntimeError(
RuntimeError: Error happened when batch testing peer-to-peer access from (0, 0, 1, 1) to (0, 1, 0, 1)

AlpinDale commented 1 week ago

I recommend running with --disable-custom-all-reduce to fix this. This seems to be some weird issue with torch's p2p check, and outside of aphrodite's control.

BlairSadewitz commented 1 week ago

OK, I'll see if that works.

BlairSadewitz commented 1 week ago

OK, that seems to take care of that with 0.6.0. I ran into a buglet, though, I think. See below. This happens whether or not I override the default port; difference between the two cases is that, if I override the default, it'll actually start, like this (happens whether or not I use ray as the backend):

aphrodite run NobodySpecial/Llama-3.1-70B-Instruct-Lorablated-Creative-Writer --disable-custom-all-reduce --enforce-eager False -tp 2 --launch-kobold-api -host 127.0.0.1 --port 6969

WARNING: Casting torch.bfloat16 to torch.float16. WARNING: The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. 2024-09-03 16:18:19,195 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2024-09-03 16:18:19,195 WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30. 2024-09-03 16:18:19,348 INFO worker.py:1783 -- Started a local Ray instance. INFO: ------------------------------------------------------------------------------------- INFO: Initializing Aphrodite Engine (v0.6.0 commit 54d6d87f) with the following config: INFO: Model = './Llama-3.1-70B-Instruct-Lorablated-Creative-Writer-FP8-Dynamic' INFO: DataType = torch.bfloat16 INFO: Tensor Parallel Size = 2 INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = True INFO: Quantization Format = 'compressed-tensors' INFO: Context Length = 131072 INFO: Enforce Eager Mode = True INFO: Prefix Caching = False INFO: Device = device(type='cuda') INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') INFO: ------------------------------------------------------------------------------------- INFO: use_ray_spmd_worker: False INFO: driver_ip: 172.17.0.3 INFO: Port 2242 is already in use, trying port 2243 INFO: Loading model ./Llama-3.1-70B-Instruct-Lorablated-Creative-Writer-FP8-Dynamic... Loading safetensors checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:07<00:00, 2.04it/s] INFO: Model loaded in 7.66 seconds. INFO: Weights memory usage: 33.88 GiB x 2 = 67.76 GiB INFO: Profiling peak memory usage... INFO: Model profiling took 20.05 seconds. INFO: KV Cache memory usage for 131072 tokens: 26.88 x 2 = 53.76 GB INFO: # GPU blocks: 10153, # CPU blocks: 1638 INFO: Minimum concurrency: 1.24x INFO: Maximum sequence length allowed in the cache: 162448 WARNING: Launching Kobold API server in addition to OpenAI. Keep in mind that the Kobold API routes are NOT protected via the API key. WARNING: embedding_mode is False. Embedding API will not work. INFO: Started server process [22908] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:6969 (Press CTRL+C to quit) INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

If I don't override the default, same as above up to starting uvicorn. Then, it throws an exception and terminates:

ERROR: [Errno 98] error while attempting to bind on address ('::', 2242, 0, 0): address already in use

And here's the traceback:

INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Gracefully stopping http server INFO: Aphrodite ZMQ RPC Server was interrupted. /root/micromamba/envs/aphrodite/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ^CINFO: Shutting down unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-1' coro=<run_server() done, defined at /root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py:631> exception=AttributeError("'Server' object has no attribute 'servers'")> Traceback (most recent call last): File "/root/micromamba/envs/aphrodite/lib/python3.11/site-packages/uvicorn/server.py", line 162, in startup server = await loop.create_server( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/micromamba/envs/aphrodite/lib/python3.11/asyncio/base_events.py", line 1536, in create_server raise OSError(err.errno, msg) from None OSError: [Errno 98] error while attempting to bind on address ('::', 2242, 0, 0): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/micromamba/envs/aphrodite/lib/python3.11/site-packages/uvicorn/server.py", line 69, in serve await self._serve(sockets) File "/root/micromamba/envs/aphrodite/lib/python3.11/site-packages/uvicorn/server.py", line 84, in _serve await self.startup(sockets=sockets) File "/root/micromamba/envs/aphrodite/lib/python3.11/site-packages/uvicorn/server.py", line 172, in startup sys.exit(1) SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 650, in run_server await shutdown_task File "/root/micromamba/envs/aphrodite/lib/python3.11/site-packages/uvicorn/server.py", line 261, in shutdown for server in self.servers: ^^^^^^^^^^^^ AttributeError: 'Server' object has no attribute 'servers' ^C (aphrodite) root@C.12263514:~$

If u need anything else, lemme know. Thanks.

AlpinDale commented 1 week ago

Right, this seems like an issue with how we're parsing env variables. I mightve messed up the default assignment for the port. I'll make a hotfix later and do a post1 release

PygmalionAI / aphrodite-engine

[Bug]: With tp > 1, model never loads, and there's hardly any CPU utilization #607

Your current environment

🐛 Describe the bug