aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
465 stars 154 forks source link

Neuron Runtime compatibility issue when loading a compiled model #660

Closed g-laz77 closed 1 year ago

g-laz77 commented 1 year ago

Description

I'm encountering a compatibility issue when trying to load a Neuron-compiled model (tmp/nllb_neuron_model.pt) on an AWS SageMaker Notebook Instance (ml.inf1.6xlarge) and kernel: conda_amazonei_pytorch_latest_p37

Model Details:

HuggingFace model identifier: "facebook/nllb-200-distilled-600M" Torch version: 1.13.1 HuggingFace version: 4.17.0 Torch-neuron version: 1.13.1.2.6.6.0

The model was successfully compiled with the following code snippet:

import os
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torchscript=True)

# create dummy input for max length 256
dummy_input = "dummy input which will be padded later"
max_length = 256
max_decoder_length=256
batch = tokenizer.prepare_seq2seq_batch(src_texts=[dummy_input], tgt_texts=[dummy_input], max_length=max_length, padding="max_length", return_tensors="pt")
neuron_inputs = tuple(batch.values())

# neuron trace model
model_neuron = torch.neuron.trace(model, neuron_inputs, separate_weights=True)
model.config.update({"traced_sequence_length": max_length})

# save neuron compiled model
save_dir="tmp"
os.makedirs("tmp",exist_ok=True)
model_neuron.save(os.path.join(save_dir,"nllb_neuron_model.pt"))
tokenizer.save_pretrained(save_dir)

However, when I try to load the Neuron-compiled model using the following code:

neuron_tokenizer = AutoTokenizer.from_pretrained(save_dir)
md_neuron = torch.jit.load(f"{save_dir}/nllb_neuron_model.pt")

I get the following error:

2023-Apr-27 15:09:56.0014 28971:28971 ERROR   NRT:nrt_init                                This Neuron Runtime (compatibility id: 6) is not compatible with the installed aws-neuron-dkms package.
2023-Apr-27 15:09:56.0014 28971:28971 ERROR   NRT:nrt_init                                Installed aws-neuron-dkms supports Runtime compatibility id: 2.
2023-Apr-27 15:09:56.0014 28971:28971 ERROR   NRT:nrt_init                                Please make sure to upgrade to latest aws-neuron-dkms; for detailed installation instructions visit Neuron documentation.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_28971/2161845668.py in <module>
      1 neuron_tokenizer = AutoTokenizer.from_pretrained("tmp")
----> 2 md_neuron = torch.jit.load("tmp/nllb_neuron_model.pt")

~/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/torch_neuron/jit_load_wrapper.py in wrapper(*args, **kwargs)
     11     @wraps(func)
     12     def wrapper(*args, **kwargs):
---> 13         script_module = jit_load(*args, **kwargs)
     14         found_neuron_function = False
     15         for node in script_module.graph.nodes():

~/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/torch/jit/_serialization.py in load(f, map_location, _extra_files)
    160     cu = torch._C.CompilationUnit()
    161     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 162         cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
    163     else:
    164         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: The PyTorch Neuron Runtime could not be initialized. Neuron Driver issues are logged
to your system logs. See the Neuron Runtime's troubleshooting guide for help on this
topic: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/

Additional Information

I have followed the instructions in the Neuron documentation to upgrade the aws-neuron-dkms package, but the issue persists.

micwade-aws commented 1 year ago

Think I see the issue. We discontinued the aws-neuron-dkms package, replacing it with aws-neuronx-dkms. The older runtimes and dkms packages would have asked users to update the "non-x" version of things since the "x" versions didn't exist back then.

Try this to update the Neuron Driver to latest:

# Remove the old driver:
sudo yum remove aws-neuron-dkms -y

# Ensure the latest kernel headers are installed:
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Then install the latest Neuron Driver (aws-neuronx-dkms):
sudo yum install aws-neuronx-dkms -y

You can check if the driver install worked by running this command - it should report back that the 2.8 version of Neuron Driver is running:

modinfo neuron

Output should look like this:

sh-4.2$ modinfo neuron
filename:       /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias:          pci:v00001d0fd00007064sv*sd*bc*sc*i*
version:        2.8.4.0
license:        GPL
description:    Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion:     D4588D647975DE320C19276
depends:        
retpoline:      Y
name:           neuron
vermagic:       5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions 
parm:           use_rr:use readless reads (int)
parm:           nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm:           nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm:           no_reset:Dont reset device (int)
parm:           nc_per_dev_param:Number of neuron cores (int)
parm:           mempool_min_alloc_size:Minimum size for memory allocation (int)
parm:           mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm:           dup_helper_enable:enable duplicate routing id unload helper (int)
parm:           wc_enable:enable write combining (int)
g-laz77 commented 1 year ago

I had already updated to aws-neuronx-dkms once. Nonentheless, I tried it again. Screenshot 2023-04-28 at 1 24 00 PM

Following that, I ran modinfo neuron Got the expected output:

filename:       /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias:          pci:v00001d0fd00007064sv*sd*bc*sc*i*
version:        2.8.4.0
license:        GPL
description:    Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion:     D4588D647975DE320C19276
depends:        
retpoline:      Y
name:           neuron
vermagic:       5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions 
parm:           use_rr:use readless reads (int)
parm:           nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm:           nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm:           no_reset:Dont reset device (int)
parm:           nc_per_dev_param:Number of neuron cores (int)
parm:           mempool_min_alloc_size:Minimum size for memory allocation (int)
parm:           mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm:           dup_helper_enable:enable duplicate routing id unload helper (int)
parm:           wc_enable:enable write combining (int)

However, I am still facing the same issue when trying to load the neuron compiled model: Screenshot 2023-04-28 at 1 28 17 PM

micwade-aws commented 1 year ago

I'll reach out for more help internally today. I'm a bit stumped since there can be only one Neuron Driver installed, but your errors make it seem like we are still using the old driver.

micwade-aws commented 1 year ago

@g-laz77 - would you mind trying this from the failing jupyterlab?

import subprocess
result = subprocess.run(["modinfo","neuron"], capture_output=True, text=True)
print(result.stdout)

Updating to the latest driver should have worked, but it's confusing that you're still getting this incompatibility warning. I am working on the reproduction, but that sanity check above would help.

g-laz77 commented 1 year ago
filename:       /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias:          pci:v00001d0fd00007064sv*sd*bc*sc*i*
version:        2.8.4.0
license:        GPL
description:    Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion:     D4588D647975DE320C19276
depends:        
retpoline:      Y
name:           neuron
vermagic:       5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions 
parm:           use_rr:use readless reads (int)
parm:           nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm:           nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm:           no_reset:Dont reset device (int)
parm:           nc_per_dev_param:Number of neuron cores (int)
parm:           mempool_min_alloc_size:Minimum size for memory allocation (int)
parm:           mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm:           dup_helper_enable:enable duplicate routing id unload helper (int)
parm:           wc_enable:enable write combining (int)
micwade-aws commented 1 year ago

@g-laz77 - Huzzah! Got some help tonight thanks to another Amazonian with the same issue and figured this out. Even though we've removed the old software and installed the new driver, the old driver is still running b/c the Jupyter process is holding a reference to it. Here's some quick steps to get out of this:

  1. Stop the Jupyter kernel
  2. Check if the neuron driver is loaded: lsmod | grep neuron
  3. If there's something loaded, execute this: sudo rmmod neuron
  4. Uninstall the driver package sudo yum remove aws-neuronx-dkms
  5. Install the driver package sudo yum install aws-neuronx-dkms
  6. Check if the neuron driver is loaded: lsmod | grep neuron

At this point, the old driver is truly removed (step 3) and the new driver would be running. If all of that seems okay, start the Jupyter kernel again - it will pick up using the new driver this time.

I'll write up some specific instructions for upgrading the Neuron Driver when using notebook instances asap, as well as, work with the team on possible improvements to avoid this issue entirely.

Let me know if you get stuck or if the above didn't help solve the problem.

g-laz77 commented 1 year ago

Hi @micwade-aws, the above solution helped clean up the neuronx-dkms package installation on the jupyter notebook.

Thanks a lot for all the help!! Really appreciate it!