Closed superobk closed 2 weeks ago
In our tests, web_ability_demo can run normally on 2*A800. You can check your configuration or provide more details so that we can discuss further.
In our tests, web_ability_demo can run normally on 2*A800. You can check your configuration or provide more details so that we can discuss further.
May I know your environment is using vllm-flash-attn v0.5.5, and CUDA with version 12.1?
The CUDA version is 12.2. vllm 0.5.5 vllm-flash-attn 2.6.1 For other environment details, please refer to #issue20
The CUDA version is 12.2. vllm 0.5.5 vllm-flash-attn 2.6.1 For other environment details, please refer to #issue20
I compared and aligned the packages in issues 20 with my environment and run again, but the same OOM error happend as below:
"
(vita_demo) [root@localhost VITA]# python -m web_demo.web_ability_demo demo_VITA_ckpt/
WARNING 09-14 00:56:36 config.py:1563] Casting torch.bfloat16 to torch.float16.
INFO 09-14 00:56:36 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='demo_VITA_ckpt/', speculative_config=None, tokenizer='demo_VITA_ckpt/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=demo_VITA_ckpt/, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 09-14 00:56:37 model_runner.py:879] Starting to load model demo_VITA_ckpt/...
rank0: Traceback (most recent call last):
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
rank0: return _run_code(code, main_globals, None,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/runpy.py", line 86, in _run_code
rank0: exec(code, run_globals)
rank0: File "/home/VITA/web_demo/web_ability_demo.py", line 361, in
rank0: File "/home/VITA/web_demo/web_ability_demo.py", line 340, in main rank0: llm = LLM( rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 175, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 473, in from_engine_args rank0: engine = cls( rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in init rank0: self.model_executor = executor_class( rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in init
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 39, in _init_executor
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/worker/worker.py", line 182, in load_model
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 881, in load_model
rank0: self.model = get_model(model_config=self.model_config,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
rank0: return loader.load_model(model_config=model_config,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
rank0: model = _initialize_model(model_config, self.load_config,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model
rank0: return build_model(
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model
rank0: return model_class(config=hf_config,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 933, in init
rank0: self.language_model = MixtralModel(config.text_config, cache_config,
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 592, in init
rank0: self.start_layer, self.end_layer, self.layers = make_layers(
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 195, in makelayers
rank0: [PPMissingLayer() for in range(start_layer)] + rank00]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 196, in
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 594, in
rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 41, in create_weights rank0: w13_weight = torch.nn.Parameter(torch.empty(num_experts, rank0: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in __torch_function__ rank0: return func(*args, **kwargs) rank0: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.33 GiB of which 1.31 GiB is free. Including non-PyTorch memory, this process has 78.01 GiB memory in use. Of the allocated memory 77.51 GiB is allocated by PyTorch, and 17.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) " Should I change parameters to make the memory usage smaller?
The CUDA version is 12.2. vllm 0.5.5 vllm-flash-attn 2.6.1 For other environment details, please refer to #issue20
I compared and aligned the packages in issues 20 with my environment and run again, but the same OOM error happend as below: " (vita_demo) [root@localhost VITA]# python -m web_demo.web_ability_demo demo_VITA_ckpt/ WARNING 09-14 00:56:36 config.py:1563] Casting torch.bfloat16 to torch.float16. INFO 09-14 00:56:36 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='demo_VITA_ckpt/', speculative_config=None, tokenizer='demo_VITA_ckpt/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=demo_VITA_ckpt/, use_v2_block_manager=False, enable_prefix_caching=False) INFO 09-14 00:56:37 model_runner.py:879] Starting to load model demo_VITA_ckpt/... [rank0]: Traceback (most recent call last): [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank0]: return _run_code(code, main_globals, None, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/runpy.py", line 86, in _run_code [rank0]: exec(code, run_globals) [rank0]: File "/home/VITA/web_demo/web_ability_demo.py", line 361, in [rank0]: main(args.model_path) [rank0]: File "/home/VITA/web_demo/web_ability_demo.py", line 340, in main [rank0]: llm = LLM( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 175, in init [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 473, in from_engine_args [rank0]: engine = cls( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in init [rank0]: self.model_executor = executor_class( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in init [rank0]: self._init_executor() [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 39, in _init_executor [rank0]: self.driver_worker.load_model() [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/worker/worker.py", line 182, in load_model [rank0]: self.model_runner.load_model() [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 881, in load_model [rank0]: self.model = get_model(model_config=self.model_config, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model [rank0]: model = _initialize_model(model_config, self.load_config, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model [rank0]: return build_model( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model [rank0]: return model_class(config=hf_config, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 933, in init [rank0]: self.language_model = MixtralModel(config.text_config, cache_config, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 592, in init [rank0]: self.start_layer, self.end_layer, self.layers = make_layers( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 195, in makelayers [rank0]: [PPMissingLayer() for in range(start_layer)] + [ [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 196, in [rank0]: maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 594, in [rank0]: lambda prefix: MixtralDecoderLayer( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 528, in init [rank0]: self.block_sparse_moe = MixtralMoE( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/mixtral.py", line 405, in init [rank0]: self.experts = FusedMoE(num_experts=num_experts, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 194, in init [rank0]: self.quant_method.create_weights( [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 41, in create_weights [rank0]: w13_weight = torch.nn.Parameter(torch.empty(num_experts, [rank0]: File "/data/anaconda3/envs/vita_demo/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in torch_function [rank0]: return func(*args, **kwargs) [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.33 GiB of which 1.31 GiB is free. Including non-PyTorch memory, this process has 78.01 GiB memory in use. Of the allocated memory 77.51 GiB is allocated by PyTorch, and 17.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) " Should I change parameters to make the memory usage smaller?
When changing tensor_parallel_size=4 with 4*80G machine, it shows the below error for illegal memory. :( Please kindly assist.
(vita_demo) [root@localhost VITA]# python -m web_demo.web_ability_demo demo_VITA_ckpt/ WARNING 09-14 00:56:36 config.py:1563] Casting torch.bfloat16 to torch.float16. INFO 09-14 00:56:36 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='demo_VITA_ckpt/', speculative_config=None, tokenizer='demo_VITA_ckpt/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,
In the config you loaded, tensor_parallel_size=1
. You should check your configuration to ensure that the value of tensor_parallel_size
is loaded correctly.
We have 2* A100-80G in our environment, but apparently we got the "OutOfMemoryError" when running the first demo.
" torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.15 GiB of which 233.44 MiB is free. Process 2199302 has 420.00 MiB memory in use. Process 2347299 has 78.51 GiB memory in use. Of the allocated memory 77.51 GiB is allocated by PyTorch, and 17.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. "
Any method or parameters to adjust and reduce the vRAM memory for the demo, so that it could be running in the small scale GPU environment, please kindly advise and share. Thanks!