robmarkcole commented 3 years ago

Having successfully deployed the model and endpoint, at client.invoke_endpoint I receive error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{
  "code": 400,
  "type": "BadRequestException",
  "message": "Parameter model_name is required."
}

Appears to be https://github.com/pytorch/serve/issues/631 with an update to the Dockerfile required (add captum)

csaroff commented 3 years ago

@robmarkcole I ran into this exact issue recently and this is the error you get when the model store directory contains no models. I believe aws mounts the models directory at /opt/ml/model.

You can reproduce this error locally by calling the invocations endpoint while your model store directory is empty.

file_name = "/path/to/file"

with open(file_name, 'rb') as f:
    payload = f.read()

response = requests.post('http://127.0.0.1:8080/invocations/', data=payload)
response.json()

Response:

{'code': 400,
 'type': 'BadRequestException',
 'message': 'Parameter model_name is required.'}

In my case, rather than building the docker container myself and pointing torchserve to /opt/ml/model from my Dockerfile, I had tried to use the torchserve docker image from dockerhub(which doesn't seem to point to /opt/ml/model as the default model store directory).

robmarkcole commented 3 years ago

@csaroff I believe you have identified the error is in the Dockerfile where the contents of model store are not copied to the container

dashjim commented 2 years ago

I have a similar error and it was because that when I tar the .mar file to .tar.gz it was under a folder. So when SageMaker uncompress it the mar file is under a folder too. However TorchServe need the *.mar file on the top.

akqjxx commented 2 years ago

batch transform error:

2022-08-30T09:01:17.792:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=50, BatchStrategy=MULTI_RECORD -- 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: ClientError: 400 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: Message: 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: { 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "code": 400, 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "type": "BadRequestException", 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "message": "Parameter model_name is required." 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: }

-------------------------------------------------my dockerfile is-------------------------------------------------

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04

NCCL_VERSION=2.4.7, CUDNN_VERSION=7.6.2.24

LABEL maintainer="Amazon AI" LABEL dlc_major_version="1" LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

Add arguments to achieve the version, python and url

ARG PYTHON=python3 ARG PYTHON_VERSION=3.7.3 ARG OPEN_MPI_VERSION=4.0.1 ARG TS_VERSION="0.3.1" ARG PT_INFERENCE_URL=https://aws-pytorch-binaries.s3-us-west-2.amazonaws.com/r1.6.0_inference/20200727-223446/b0251e7e070e57f34ee08ac59ab4710081b41918/gpu/torch-1.6.0-cp36-cp36m-manylinux1_x86_64.whl ARG PT_VISION_URL=https://torchvision-build.s3.amazonaws.com/1.6.0/gpu/torchvision-0.7.0-cp36-cp36m-linux_x86_64.whl

See http://bugs.python.org/issue19846

ENV LANG C.UTF-8 ENV LD_LIBRARY_PATH /opt/conda/lib/:$LD_LIBRARY_PATH ENV PATH /opt/conda/bin:$PATH ENV SAGEMAKER_SERVING_MODULE sagemaker_pytorch_serving_container.serving:main ENV TEMP=/home/model-server/tmp

RUN apt-get update \ && apt-get install -y --no-install-recommends software-properties-common \ && add-apt-repository ppa:openjdk-r/ppa \ && apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ build-essential \ ca-certificates \ cmake \ curl \ emacs \ git \ jq \ libgl1-mesa-glx \ libglib2.0-0 \ libgomp1 \ libibverbs-dev \ libnuma1 \ libnuma-dev \ libsm6 \ libxext6 \ libxrender-dev \ openjdk-11-jdk \ vim \ wget \ unzip \ zlib1g-dev

https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files

RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \ mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \ /var/lib/dpkg/info/ca-certificates-java.postinst configure;

RUN wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-$OPEN_MPI_VERSION.tar.gz \ && gunzip -c openmpi-$OPEN_MPI_VERSION.tar.gz | tar xf - \ && cd openmpi-$OPEN_MPI_VERSION \ && ./configure --prefix=/home/.openmpi \ && make all install \ && cd .. \ && rm openmpi-$OPEN_MPI_VERSION.tar.gz \ && rm -rf openmpi-$OPEN_MPI_VERSION

ENV PATH="$PATH:/home/.openmpi/bin" ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"

Install OpenSSH. Allow OpenSSH to talk to containers without asking for confirmation

RUN apt-get install -y --no-install-recommends \ openssh-client \ openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_configs

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh \ && chmod +x ~/miniconda.sh \ && ~/miniconda.sh -b -p /opt/conda \ && rm ~/miniconda.sh RUN /opt/conda/bin/conda update conda \ && /opt/conda/bin/conda install -c conda-forge \ python=$PYTHON_VERSION \ && /opt/conda/bin/conda install -y \ cython==0.29.12 \ ipython==7.7.0 \ mkl-include==2019.4 \ mkl==2019.4 \ numpy==1.19.1 \ scipy==1.3.0 \ typing==3.6.4 \ && /opt/conda/bin/conda clean -ya

RUN conda install -c \ pytorch magma-cuda101 \ && conda install -c \ conda-forge \ opencv==4.0.1 \ && conda install -y \ scikit-learn==0.21.2 \ pandas==0.25.0 \ h5py==2.9.0 \ requests==2.22.0 \ && conda clean -ya \ && /opt/conda/bin/conda config --set ssl_verify False \ && pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \ && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \ && pip install packaging==20.4 \ enum-compat==0.0.3 \ ruamel-yaml

Uninstall and re-install torch and torchvision from the PyTorch website

RUN pip install --no-cache-dir -U https://pypi.tuna.tsinghua.edu.cn/packages/5d/5e/35140615fc1f925023f489e71086a9ecc188053d263d3594237281284d82/torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl#sha256=87d65c01d1b70bb46070824f28bfd93c86d3c5c56b90cbbe836a3f2491d91c76 RUN pip uninstall -y torchvision \ && pip install --no-deps --no-cache-dir -U https://mirrors.aliyun.com/pypi/packages/4d/b5/60d5eb61f1880707a5749fea43e0ec76f27dfe69391cdec953ab5da5e676/torchvision-0.7.0-cp37-cp37m-manylinux1_x86_64.whl#sha256=0d1a5adfef4387659c7a0af3b72e16caa0c67224a422050ab65184d13ac9fb13

RUN pip uninstall -y model-archiver multi-model-server \ && pip install captum \ && pip install torchserve==$TS_VERSION \ && pip install torch-model-archiver==$TS_VERSION

RUN useradd -m model-server \ && mkdir -p /home/model-server/tmp /opt/ml/model \ && chown -R model-server /home/model-server /opt/ml/model

COPY torchserve-entrypoint.py /usr/local/bin/dockerd-entrypoint.py COPY config.properties /home/model-server

RUN chmod +x /usr/local/bin/dockerd-entrypoint.py

ADD https://raw.githubusercontent.com/aws/deep-learning-containers/master/src/deep_learning_container.py /usr/local/bin/deep_learning_container.py

RUN chmod +x /usr/local/bin/deep_learning_container.py

RUN pip install --no-cache-dir "sagemaker-pytorch-inference>=2"

RUN curl https://aws-dlc-licenses.s3.amazonaws.com/pytorch-1.6.0/license.txt -o /license.txt

RUN conda install -y -c conda-forge "pyyaml>5.4,<5.5" RUN pip install pillow==8.2.0 "awscli<2"

RUN python3 -m pip install detectron2==0.4 -f \ https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/index.html

RUN HOME_DIR=/root \ && curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \ && unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \ && cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance \ && chmod +x /usr/local/bin/testOSSCompliance \ && chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh \ && ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} \ && rm -rf ${HOME_DIR}/oss_compliance*

EXPOSE 8080 8081 ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"] CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]

aws-samples / amazon-sagemaker-endpoint-deployment-of-fastai-model-with-torchserve

ModelError Parameter model_name is required #6