LLM engine not using Neuron device with continuous batching using vLLM

ashwinikumar-sa commented 2 months ago

I was trying to test recently added continuous batching support (beta) with vLLM in transformers-neuronx in latest Neuron 2.18.1 release. Simply tried steps given in example from Neuron docs using an Inf2.xlarge instance but vLLM's LLM engine fails to load model on Neuron device despite of explicit setting of device="neuron" as explained in the example. It’s still setting the device_config=cpu as seen below:

(aws_neuron_venv_pytorch) ubuntu@ip-172-31-64-93:~$ python vllm-inference.py 
INFO 04-16 13:43:36 llm_engine.py:87] Initializing an LLM engine with config: model='openlm-research/open_llama_3b', tokenizer='openlm-research/open_llama_3b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, seed=0)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Killed

Also I can see OOM error below killing the process in /var/log/syslog, Apr 16 13:45:18 ip-172-31-64-93 kernel: [13790.039728] Out of memory: Killed process 11891 (python3) total-vm:19217764kB, anon-rss:15380124kB, file-rss:2808kB, shmem-rss:0kB, UID:1000 pgtables:31920kB oom_score_adj:0

Test environment: Instance type: Inf2.xlarge OS: Ubuntu 20.04 Python 3.8.10 Neuron SDK: 2.18.1 transformers-neuronx-0.10.0.360 neuronx-cc-2.13.68.0 vllm 0.3.3+neuron213

liangfu commented 2 months ago

It seems that we need to add tensor_parallel_size=2 argument for the script. Try this one instead? https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_neuron.py It has been fixed in upstream vLLM.

ashwinikumar-sa commented 2 months ago

It seems that we need to add tensor_parallel_size=2 argument for the script. Try this one instead? https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_neuron.py It has been fixed in upstream vLLM.

Yes, that worked. Thank you!

aws-neuron / aws-neuron-sdk

LLM engine not using Neuron device with continuous batching using vLLM #873