Closed ashwinikumar-sa closed 2 months ago
It seems that we need to add tensor_parallel_size=2
argument for the script.
Try this one instead? https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_neuron.py
It has been fixed in upstream vLLM.
It seems that we need to add
tensor_parallel_size=2
argument for the script. Try this one instead? https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_neuron.py It has been fixed in upstream vLLM.
Yes, that worked. Thank you!
I was trying to test recently added continuous batching support (beta) with vLLM in
transformers-neuronx
in latest Neuron 2.18.1 release. Simply tried steps given in example from Neuron docs using an Inf2.xlarge instance but vLLM's LLM engine fails to load model on Neuron device despite of explicit setting ofdevice="neuron"
as explained in the example. It’s still setting thedevice_config=cpu
as seen below:Also I can see OOM error below killing the process in
/var/log/syslog
,Apr 16 13:45:18 ip-172-31-64-93 kernel: [13790.039728] Out of memory: Killed process 11891 (python3) total-vm:19217764kB, anon-rss:15380124kB, file-rss:2808kB, shmem-rss:0kB, UID:1000 pgtables:31920kB oom_score_adj:0
Test environment: Instance type: Inf2.xlarge OS: Ubuntu 20.04 Python 3.8.10 Neuron SDK: 2.18.1 transformers-neuronx-0.10.0.360 neuronx-cc-2.13.68.0 vllm 0.3.3+neuron213