aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
443 stars 148 forks source link

ECS inf1 neuron hook script fails #911

Open rantoniuk opened 3 months ago

rantoniuk commented 3 months ago

I'm trying to run ECS on inf1.2xlarge instance, using AMI ID: ami-0a9852dd958cde533 , al2023-ami-ecs-neuron-hvm-2023.0.20240610-kernel-6.1-x86_64.

In the EC2 UserData, I'm doing:

      sudo cp /opt/aws/neuron/share/docker-daemon.json /etc/docker/daemon.json
      sudo systemctl restart docker'

That however results in Docker failing to start, with the following error:

level=info time=2024-06-19T16:38:31Z msg="Starting Amazon Elastic Container Service Agent"
level=error time=2024-06-19T16:38:32Z msg="could not start Agent: API error (400): failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/b2b6d9f3947516c395ac544748baaa909e0f7156067a1082c8592a0aea88517a/log.json: no such file or directory): /opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh did not terminate successfully: exit status 2: unknown"

Indeed, when trying to do this manually, I get an error:

[root@ip-10-0-5-15 bin]# /opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh
/usr/bin/which: no oci-add-hooks in (/root/.local/bin:/root/bin:/opt/aws/neuron/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin)
/opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh: line 6: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [argument ...]] [redirection ...]

[root@ip-10-0-5-15 bin]# rpm -qa | grep neuron
aws-neuronx-dkms-2.16.7.0-dkms.noarch
aws-neuronx-oci-hook-2.3.0.0-1.x86_64
aws-neuronx-tools-2.17.1.0-1.x86_64

After investigating, it seems that the oci-add-hooks.x86_64 package is missing in the AMI image, after installing it docker starts up fine.

geetasg commented 3 months ago

Thank you for reporting the issue. I am trying out the steps on my side. Can you please confirm the runtime in use? The error string has reference to containerd.

level=error time=2024-06-19T16:38:32Z msg="could not start Agent: API error (400): failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/b2b6d9f3947516c395ac544748baaa909e0f7156067a1082c8592a0aea88517a/log.json: no such file or directory): /opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh did not terminate successfully: exit status 2: unknown"

Quick reference for docs https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/tutorial-oci-hook.html

rantoniuk commented 3 months ago

This report is about AWS AMI image that is supposed to have all the prerequisites for Neuron already installed, while it does not.