Inf2 Modified Llama 2 Loading Issue

liechtym commented 9 months ago

I've been following the https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb example. However I came across with an issue using a modified version of LLama made for MiniGPT4.

I'm running on a Inf2.8xlarge with "AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231205".

I updated to the latest Neuron version via python -m pip install --upgrade neuronx-cc==2. --pre torch-neuronx==2.0. torchvision

Here's my code to compile. This finishes properly.

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')

import torch
from transformers_neuronx.module import save_pretrained_split

save_pretrained_split(model, './MiniGPT-4-LLaMA-7b-split')

I then attempt to run it with the following code:

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
from minigpt4.models.modeling_llama import LlamaForCausalLM

import os
# Compiler flag -O1 is a workaround for “Too many instructions after unroll” in SDK 2.14                                                                       
# os.environ['NEURON_CC_FLAGS'] = '-O1'                                                                                                                        

# load meta-llama/Llama-2-13b to the NeuronCores with 24-way tensor parallelism and run compilation                                                            
neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()

# construct a tokenizer and encode prompt text                                                                                                                 
tokenizer = AutoTokenizer.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling                                                                                                                            
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

And I get this error:

Traceback (most recent call last):
  File "run.py", line 12, in <module>
    neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 145, in from_pretrained
    state_dict_path = os.path.join(pretrained_model_path, 'pytorch_model.bin')
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './MiniGPT-4-LLaMA-7b-split/pytorch_model.bin'

These are the files in ./MiniGPT-4-LLaMA-7b-split:

config.json  generation_config.json  model.safetensors

Any help or direction would be stellar! Thanks.

mrnikwaws commented 9 months ago

Hi @liechtym

Thanks for reporting the problem. We've reproduced the problem and have a fix in an upcoming release. We'll respond here and close this issue once the release is out

liechtym commented 9 months ago

@mrnikwaws Thank you very much! I appreciate the quick response and look forward to the release.

mrnikwaws commented 9 months ago

2.16 is now released and should address your issue. Please respond on this ticket if the issue is not resolved. If we don't hear back we'll close the issue.

liechtym commented 9 months ago

Thank you much!

liechtym commented 9 months ago

@mrnikwaws I just tried with the following demo code and I'm still getting the same error.

https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

I verified my installation from the latest commit in the repo with pip freeze: transformers-neuronx @ git+https://github.com/aws-neuron/transformers-neuronx.git@426629648481095dfbb4f6bd993f25b88a87b505

I only changed a couple things from the demo. Instead of using 'llama-2-13b' I used 'meta-llama/Llama-2-7b-chat-hf' in LlamaForCausalLM.from_pretrained(). The only other change was tp_degree=2 in LlamaForSampling.from_pretrained().

Traceback:

Traceback (most recent call last):
  File "run.py", line 11, in <module>
    neuron_model = LlamaForSampling.from_pretrained('./Llama-2-7b-chat-hf-split', batch_size=1, tp_degree=2, amp='f16')
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 148, in from_pretrained
    state_dict = torch.load(state_dict_path)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 791, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 271, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 252, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './Llama-2-7b-chat-hf-split/pytorch_model.bin'

Again, I'm on the same instance, AMI, and setup as before.

shebbur-aws commented 9 months ago

@liechtym Sorry for the inconvenience. We have a fix for this in transformers-neuronx github repo which has been updated today. Can you please check with the latest?

liechtym commented 9 months ago

@shebbur-aws Yes I'll check with the latest and update you soon.

liechtym commented 9 months ago

@shebbur-aws This issue seems to be resolved when reinstalling from the Github repo.

However, I am now getting the following error while running meta-llama-2-13b-sampling.ipynb with the modifications I described in the previous comment. Let me know if you'd like me to create a new issue for this.

2024-01-04 14:33:59.000295:  4197  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000383:  4198  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000471:  4199  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000492:  4197  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_9e281341e7845ee2287f+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:33:59.000563:  4200  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000601:  4198  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_a4faa198082ac5b8d787+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:33:59.000623:  4201  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000703:  4202  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000754:  4203  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000755:  4199  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_d5006487226e226573ea+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:33:59.000756:  4204  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000790:  4205  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:33:59.000862:  4206  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-04 14:34:00.000087:  4200  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_1bf56f238691e0fd88c8+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000440:  4202  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_70d1a1ce4d52a869b9e6+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000440:  4201  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_c46e110ea38cea049c6d+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000464:  4203  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_b9a15c837cee1bf59e24+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000464:  4204  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_1f6eaa498df4dc58af20+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000465:  4205  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_d750f56f8d6a41f0372e+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-04 14:34:00.000465:  4206  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_e22db4da23e4fde86dd1+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-Jan-04 14:34:00.727597  4120:4181  ERROR  NEFF:neff_parse                              NEFF version: 2.0, features: 0x100 are not supported.  Currently supporting: 0x80000000000000ff
2024-Jan-04 14:34:00.727647  4120:4181  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/neuroncc_compile_workdir/63403e3c-2309-43cd-8e3d-89f3abb77371/model.MODULE_9e281341e7845ee2287f+2c2d707e.neff, err: 10
2024-Jan-04 14:34:00.727686  4120:4182  ERROR  NEFF:neff_parse                              NEFF version: 2.0, features: 0x100 are not supported.  Currently supporting: 0x80000000000000ff
2024-Jan-04 14:34:00.727716  4120:4182  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/neuroncc_compile_workdir/63403e3c-2309-43cd-8e3d-89f3abb77371/model.MODULE_9e281341e7845ee2287f+2c2d707e.neff, err: 10
Traceback (most recent call last):
  File "run.py", line 12, in <module>
    neuron_model.to_neuron()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 72, in to_neuron
    self.setup()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 63, in setup
    nbs.setup()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 335, in setup
    self.program.setup(self.layers, self.pre_layer_parameters, self.ln_lm_head_params)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 1449, in setup
    super().setup(layers, pre_layer_params, ln_lm_head_params)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 1325, in setup
    kernel.load(io_ring_cache_size)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 454, in load
    self.model.load()
RuntimeError: nrt_load_collectives status=10

shebbur-aws commented 9 months ago

@liechtym Looks like there is a mismatch in compiler and runtime/tools version you are using. Can you please upgrade your runtime packages to 2.16 version as well which should fix this issue you are seeing.

liechtym commented 9 months ago

Thanks @shebbur-aws. I will try this out and report back soon.

liechtym commented 9 months ago

It's working great! Thanks! If I have any additional issues I'll file a different issue. Thanks again.

aws-neuron / transformers-neuronx

Inf2 Modified Llama 2 Loading Issue #67