Closed g-laz77 closed 1 year ago
Think I see the issue. We discontinued the aws-neuron-dkms package, replacing it with aws-neuronx-dkms. The older runtimes and dkms packages would have asked users to update the "non-x" version of things since the "x" versions didn't exist back then.
Try this to update the Neuron Driver to latest:
# Remove the old driver:
sudo yum remove aws-neuron-dkms -y
# Ensure the latest kernel headers are installed:
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y
# Then install the latest Neuron Driver (aws-neuronx-dkms):
sudo yum install aws-neuronx-dkms -y
You can check if the driver install worked by running this command - it should report back that the 2.8 version of Neuron Driver is running:
modinfo neuron
Output should look like this:
sh-4.2$ modinfo neuron
filename: /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias: pci:v00001d0fd00007064sv*sd*bc*sc*i*
version: 2.8.4.0
license: GPL
description: Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion: D4588D647975DE320C19276
depends:
retpoline: Y
name: neuron
vermagic: 5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions
parm: use_rr:use readless reads (int)
parm: nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm: nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm: no_reset:Dont reset device (int)
parm: nc_per_dev_param:Number of neuron cores (int)
parm: mempool_min_alloc_size:Minimum size for memory allocation (int)
parm: mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm: dup_helper_enable:enable duplicate routing id unload helper (int)
parm: wc_enable:enable write combining (int)
I had already updated to aws-neuronx-dkms once. Nonentheless, I tried it again.
Following that, I ran modinfo neuron
Got the expected output:
filename: /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias: pci:v00001d0fd00007064sv*sd*bc*sc*i*
version: 2.8.4.0
license: GPL
description: Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion: D4588D647975DE320C19276
depends:
retpoline: Y
name: neuron
vermagic: 5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions
parm: use_rr:use readless reads (int)
parm: nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm: nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm: no_reset:Dont reset device (int)
parm: nc_per_dev_param:Number of neuron cores (int)
parm: mempool_min_alloc_size:Minimum size for memory allocation (int)
parm: mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm: dup_helper_enable:enable duplicate routing id unload helper (int)
parm: wc_enable:enable write combining (int)
However, I am still facing the same issue when trying to load the neuron compiled model:
I'll reach out for more help internally today. I'm a bit stumped since there can be only one Neuron Driver installed, but your errors make it seem like we are still using the old driver.
@g-laz77 - would you mind trying this from the failing jupyterlab?
import subprocess
result = subprocess.run(["modinfo","neuron"], capture_output=True, text=True)
print(result.stdout)
Updating to the latest driver should have worked, but it's confusing that you're still getting this incompatibility warning. I am working on the reproduction, but that sanity check above would help.
filename: /lib/modules/5.10.157-139.675.amzn2.x86_64/kernel/drivers/neuron/neuron.ko
alias: pci:v00001d0fd00007064sv*sd*bc*sc*i*
version: 2.8.4.0
license: GPL
description: Neuron Driver, built from SHA: 06b4257cef1dd9960e543b2cc7c208e178d02672
srcversion: D4588D647975DE320C19276
depends:
retpoline: Y
name: neuron
vermagic: 5.10.157-139.675.amzn2.x86_64 SMP mod_unload modversions
parm: use_rr:use readless reads (int)
parm: nmetric_metric_post_delay:Minimum time to wait (in milliseconds) before posting metrics again (uint)
parm: nmetric_log_posts:0: send metrics to CW, 1: send metrics to trace, >=2: send metrics to both (uint)
parm: no_reset:Dont reset device (int)
parm: nc_per_dev_param:Number of neuron cores (int)
parm: mempool_min_alloc_size:Minimum size for memory allocation (int)
parm: mempool_host_memory_size:Host memory to reserve(in bytes) (int)
parm: dup_helper_enable:enable duplicate routing id unload helper (int)
parm: wc_enable:enable write combining (int)
@g-laz77 - Huzzah! Got some help tonight thanks to another Amazonian with the same issue and figured this out. Even though we've removed the old software and installed the new driver, the old driver is still running b/c the Jupyter process is holding a reference to it. Here's some quick steps to get out of this:
lsmod | grep neuron
sudo rmmod neuron
sudo yum remove aws-neuronx-dkms
sudo yum install aws-neuronx-dkms
lsmod | grep neuron
At this point, the old driver is truly removed (step 3) and the new driver would be running. If all of that seems okay, start the Jupyter kernel again - it will pick up using the new driver this time.
I'll write up some specific instructions for upgrading the Neuron Driver when using notebook instances asap, as well as, work with the team on possible improvements to avoid this issue entirely.
Let me know if you get stuck or if the above didn't help solve the problem.
Hi @micwade-aws, the above solution helped clean up the neuronx-dkms package installation on the jupyter notebook.
Thanks a lot for all the help!! Really appreciate it!
Description
I'm encountering a compatibility issue when trying to load a Neuron-compiled model (tmp/nllb_neuron_model.pt) on an AWS SageMaker Notebook Instance (ml.inf1.6xlarge) and kernel: conda_amazonei_pytorch_latest_p37
Model Details:
HuggingFace model identifier: "facebook/nllb-200-distilled-600M" Torch version: 1.13.1 HuggingFace version: 4.17.0 Torch-neuron version: 1.13.1.2.6.6.0
The model was successfully compiled with the following code snippet:
However, when I try to load the Neuron-compiled model using the following code:
I get the following error:
Additional Information
I have followed the instructions in the Neuron documentation to upgrade the aws-neuron-dkms package, but the issue persists.