PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

[Usage]: odd use of GPUS number and tensor parallelism #426

Closed puppetm4st3r closed 3 weeks ago

puppetm4st3r commented 3 weeks ago

Your current environment

for some reason, when executing the script in a fresh new server I got:

Collecting environment information... Traceback (most recent call last): File "/home/dario/work/dm/Dolf/server/env.py", line 623, in main() File "/home/dario/work/dm/Dolf/server/env.py", line 600, in main output = get_pretty_env_info() File "/home/dario/work/dm/Dolf/server/env.py", line 595, in get_pretty_env_info return pretty_str(get_env_info()) File "/home/dario/work/dm/Dolf/server/env.py", line 404, in get_env_info pip_version, pip_list_output = get_pip_packages(run_lambda) File "/home/dario/work/dm/Dolf/server/env.py", line 374, in get_pip_packages out = run_with_pip([sys.executable, '-mpip']) File "/home/dario/work/dm/Dolf/server/env.py", line 370, in run_with_pip return "\n".join(line for line in out.splitlines() AttributeError: 'NoneType' object has no attribute 'splitlines'

How would you like to use Aphrodite?

I want to put a model on 3 gpu's, many models have attention heads are multiples of 2, so I constantly get this stack trace:

Is there a way to shard a model in an asymetric layer distribution in order to use 3, 5, 7 gpus ? best regards

Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /data/hub --model TheBloke/laser-dolphin-mixtral-2x7b-dpo-GPTQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 3 --gpu-memory-utilization 0.97 --enforce-eager --launch-kobold-api --port 3000 --trust-remote-code --disable-log-stats --api-keys 123 --block-size 8 --max-paddings 512 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 24000 --max-num-seqs 46 --quantization gptq
WARNING:  Launching Kobold API server in addition to OpenAI. Keep in mind that 
the Kobold API routes are NOT protected via the API key.
WARNING:  Admin key not provided. Admin operations will be disabled.
WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  gptq quantization is not fully optimized yet. The speed can be slower 
than non-quantized models.
INFO:     Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory 
footprint and boosts the performance. But it may cause slight accuracy drop. 
Currently we only support fp8 without scaling factors and make e5m2 as a default
format.
2024-04-23 23:39:14,522 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'TheBloke/laser-dolphin-mixtral-2x7b-dpo-GPTQ'
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 3
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gptq
INFO:     Context Length = 12000
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = fp8_e5m2
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 102, in __init__
    self._verify_args()
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 290, in _verify_args
    self.model_config.verify_with_parallel_config(self.parallel_config)
  File "/app/aphrodite-engine/aphrodite/common/config.py", line 282, in verify_with_parallel_config
    raise ValueError(
ValueError: Total number of attention heads (32) must be divisible by tensor parallel size (3).
AlpinDale commented 3 weeks ago

Hi, unfortunately not. This is a hard limitation of Tensor Parallelism. The only way we could overcome this would be using Pipeline Parallelism, but that's not implemented as of now.

puppetm4st3r commented 3 weeks ago

Thanks, Have to move into another 3090 for the server 🥲