aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
439 stars 142 forks source link

Pytorch application and runtime in same container? #350

Closed diazGT94 closed 2 years ago

diazGT94 commented 2 years ago

I've use the template presented hereto create my Docker image and add the dependencies for my app as is specified. # Include your APP dependencies here. COPY ./package /package/ RUN pip install -r /package/requirements.txt

The I used the template of the entry point from here and modified the line 41 to start my application: python main.py --key_dev True

Doing this I suceed to run the docker image, then I stop the neuron-rtd service before running my image.

The image starts to run an first it displays information from neuron-top image As it can be seen there is no models loaded to the core, if I exit the neuron-app, I see that my image is stuck in nrtd[7]: [NRTD:RunServer] Server listening on unix:/run/neuron.sock and my application is never executed.

As debug I printed the value of "$1" which is used in the bash script to run the application, as it can be seen the condition of the value is never True. Therfore my python script is never executed.

echo "$1"
if [[ "$1" = "serve" ]]; then
  # Start your application here!
  # e.g: 'python my_server_app.py'
  echo "Hello World"
  'python main.py  --key_dev True'

image

I would like to know why the value is never set as serve and what should I do to successfully run my application in the container.

awsrjh commented 2 years ago

let us look.

thanks

diazGT94 commented 2 years ago

I was able to run my python code by replacing in the Dockerfile by replacing CMD ["nueron-top"] with CMD ["serve"]. However, I found the following error when the model is loading to the chip. image

I check if aws-neuron-dkms was installed in the image using dpkg -l | grep neuron and as it can be seen from the attached image is not installed.

I followed the steps indicated here and tried to install aws-neuron-dkms in my image but when the command RUN apt-get install aws-neuron-dkms -y is executed it returns the error

Building for 5.4.0-1058-aws
Building for architecture x86_64
Building initial module for 5.4.0-1058-aws
Done.

neuron:
Running module version sanity check.

Running the pre_install script:
/var/lib/dkms/aws-neuron/2.2.6.0/source/./preinstall: line 2: udevadm: command not found
Error! pre_install failed, aborting install.
You may override by specifying --force.
dpkg: error processing package aws-neuron-dkms (--configure):
 installed aws-neuron-dkms package post-installation script subprocess returned error exit status 101
Setting up build-essential (12.4ubuntu1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.4) ...
Errors were encountered while processing:
 aws-neuron-dkms
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get install aws-neuron-dkms  -y' returned a non-zero code: 100

and the image is never built. I don't now if the error of why my model doesn't load in the previous Docker image is related in the version of Neuron I used to convert it from pythorch to pytorch-neuron?

awsrjh commented 2 years ago

@diazGT94 - our latest release eliminated the need for neuron-rtd and simplified the container deployment experience.

In your specific case, there are two issues that stand out:

  1. The aws-neuron-dkms package cannot be installed into a container. It's a kernel mode driver that must only be installed on the underlying OS/AMI. Please remove that line and any line that's installing aws-neuron-runtime. Neither are needed.
  2. The use of the entrypoint script that starts neuron-rtd is no longer necessary. It was necessary on the older versions of Neuron that required neuron-rtd, but the latest Neuron components come with the runtime baked in as a library.

Please check out this document for more details on getting a working container with Neuron: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-deploy/tutorials/neuron-container.html

If you have any further problems with the container setup, let us know here.

diazGT94 commented 2 years ago

@awsrjh Thanks for your help.

I used the Dockerfile provided here and modified to have the following Dockerfile:

FROM ubuntu:18.04

LABEL maintainer=" "

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    ffmpeg \
    libsm6 \
    libxext6 \
    gnupg2 \
    wget \
    python3-pip \
    python3-setuptools \
    && cd /usr/local/bin \
    && pip3 --no-cache-dir install --upgrade pip \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com bionic main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Installing Neuron Tools
RUN apt-get update -y && apt-get install -y \
    aws-neuron-tools

# Sets up Path for Neuron tools
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"

# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
    torch-neuron \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

COPY ./package /package/

RUN pip install -r /package/requirements.txt

WORKDIR "/package"

ENTRYPOINT ["python3", "main.py"]

By doing this I succeed in building my image. However, when I tried to run my image using the commands specified on the tutorial. I still have and error when my script tries to load the model, which indicates that the Pytorch Neuron Runtime could not be initialized as you can see from the image below.

image

awsrjh commented 2 years ago

reopening

awsrjh commented 2 years ago

Hi

one possibility: when you removed the aws-neuron-dkms from the container config -- did you put that step into the Base OS? The driver needs to be installed on the base operating system.

First you need to remove the old driver ( this is found here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/pytorch-setup/pytorch-install.html#develop-on-aws-ml-accelerator-instance ) :

Do these steps on the base operating system - not in a container config:

  1. Stop Neuron Runtime 1.x daemon (neuron-rtd) by running: sudo systemctl stop neuron-rtd

  2. Uninstall neuron-rtd by running: sudo apt remove aws-neuron-runtime

  3. Install or upgrade to latest Neuron driver (aws-neuron-dkms) by following the “Setup Guide” instructions.:

sudo apt-get update -y

sudo apt-get install linux-headers-$(uname -r) -y

sudo apt-get install aws-neuron-dkms -y

All of this is found in this guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/pytorch-setup/pytorch-install.htm

diazGT94 commented 2 years ago

Closing it again, doing this on the Base OS solved the issue.