[Usage]: OOM crash following Offline Inference setup

eedmond commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect 
CMake version: version 3.29.3
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect 
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 537.70
cuDNN version: Could not collect 
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13700F
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4224.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          576 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           24 MiB (12 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

How would you like to use Aphrodite?

I want to run Mistral-7B-v0.3 and send it repeated prompts using the continuous batching feature of Aphrodite. I'm following the wiki page for [Offline Inference], but it keeps crashing with an OOM error when running the simple Python script.

The only changes I've made from the wiki is tensor_parallel_size is set to 1 and the model is set to v0.3 instead of v0.1 (and added a closing ) to the llm.generate line). Here is the output of executing that script with the OOM error:

$ python3 llm.py
/home/user/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING:  Possibly too large swap space. 4.00 GiB out of the 7.62 GiB total CPU memory is allocated for the swap space.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = 'mistralai/Mistral-7B-v0.3'
INFO:     Speculative Config = None
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
WARNING:  Possibly too large swap space. 4.00 GiB out of the 7.62 GiB total CPU memory is allocated for the swap space.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
WARNING:  Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO:     Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO:     Using XFormers backend.
INFO:     Using model weights format ['*.safetensors']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/projects/rambler-llm/llm.py", line 10, in <module>
[rank0]:     llm = LLM(model="mistralai/Mistral-7B-v0.3", tensor_parallel_size=1)
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/endpoints/llm.py", line 107, in __init__
[rank0]:     self.llm_engine = AphroditeEngine.from_engine_args(engine_args)
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 219, in from_engine_args
[rank0]:     engine = cls(**engine_config.to_dict(),
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 131, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/executor/executor_base.py", line 39, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/executor/gpu_executor.py", line 24, in _init_executor
[rank0]:     self._init_non_spec_worker()
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/executor/gpu_executor.py", line 52, in _init_non_spec_worker
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 125, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 179, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 103, in get_model
[rank0]:     model.load_weights(model_config.model, model_config.download_dir,
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/modeling/models/llama.py", line 472, in load_weights
[rank0]:     for name, loaded_weight in hf_model_weights_iterator(
[rank0]:   File "/home/user/.local/lib/python3.10/site-packages/aphrodite/modeling/hf_downloader.py", line 365, in hf_model_weights_iterator
[rank0]:     with safe_open(st_file, framework="pt") as f:
[rank0]: RuntimeError: unable to mmap 14496078512 bytes from file </home/user/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.3/snapshots/b67d6a03ca097c5122fa65904fce0413500bf8c8/consolidated.safetensors>: Cannot allocate memory (12)

Any help on how to get this to run would be appreciated. Thanks!

sgsdxzy commented 1 month ago

It seems mistral v0.3 incorrectly put consolidated.safetensors in the repo, you need to remove that file.

eedmond commented 1 month ago

OK, thanks. It keeps trying to re-download this file, so to keep it simple, I'll just try Llama3, which is smaller.

Regarding continuous batching, I don't see much in the docs outlining how to accomplish this (I want to execute more prompts than would fit in memory, for example). Is there a good resource on how to loop and constantly add parallel prompts? Does it work to simply call llm.generate in a loop with a certain number of prompts?

Thanks!

sgsdxzy commented 1 month ago

You can point model to the local cache (usually under ~/.cache/huggingface) after removing that file. By design aphrodite would read all *.safetensors files in the repo, and that file is superfluous. If you want continuous batching without using the api server, you need to use the AsyncAphrodite class in your program instead of the LLM class https://github.com/PygmalionAI/aphrodite-engine/blob/0178b4d97682dc165ecba184e7db509776847e33/aphrodite/engine/async_aphrodite.py#L281

PygmalionAI / aphrodite-engine

[Usage]: OOM crash following Offline Inference setup #494

Your current environment

How would you like to use Aphrodite?