aws-neuron / transformers-neuronx

Apache License 2.0
88 stars 25 forks source link

Can't save/serialize any models except GPT2 #58

Closed awskila closed 6 days ago

awskila commented 7 months ago

I am trying to save the Neuron model and deploy it to SageMaker as an endpoint. I noticed in the documentation, under serialization support, it is stated that all models can be loaded or saved except GPTJ and GPTNeoX model classes.

However, I tried several models, including Llama2-13b, OPT-30B, OPT-66B, and Llama2-70B, and none of these models can be saved using several methods.

1) I tried <neuron_model>.save, which doesn't exist. It only appears to exist for GPT2 models. 2) I tried <neuron_model>.state_dict(), which fails on all LazyModules. 3) I tried torch.save or via torchscript using torch.jit.save and then trying to use the state_dict().

Below is an example using OPT-66B.

Traceback (most recent call last): File "opt.py", line 63, in print(f"\ndecoder: {neuron_model.chkpt_model.model.decoder.state_dict()}") File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1445, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1356, in _save_to_state_dict destination[prefix + name] = param if keep_vars else param.detach() File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/parameter.py", line 144, in __torch_function__ raise ValueError( ValueError: Attempted to use an uninitialized parameter in <method 'detach' of 'torch._C._TensorBase' objects>. This error happens when you are using a LazyModule or explicitly manipulating torch.nn.parameter.UninitializedParameter objects. When using LazyModules Call forward with a dummy batch to initialize the parameters before calling torch functions

Is there anything that can be done to fix this? I've tried the last five versions of transformers-neuronx see here. Please advise. Thanks!

aws-mvaria commented 7 months ago

Thank you, we are taking a look and will get back to you shortly.

mrnikwaws commented 6 months ago

Hi @awskila,

Can you share some test code? I ran some test code using the current (2.15) production wheels and I was not able to reproduce your problem.

gsnaws commented 6 months ago

Hi @awskila Llama2-13B sample is made available with 2.16. Can you please try to see if that works or a specific code snippet to reproduce will help move this issue forward. Thanks! https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

aws-rhsoln commented 6 days ago

Closing the ticket. Please re-open if the issue persists