Closed robmarkcole closed 3 years ago
@robmarkcole I ran into this exact issue recently and this is the error you get when the model store directory contains no models. I believe aws mounts the models directory at /opt/ml/model
.
You can reproduce this error locally by calling the invocations endpoint while your model store directory is empty.
file_name = "/path/to/file"
with open(file_name, 'rb') as f:
payload = f.read()
response = requests.post('http://127.0.0.1:8080/invocations/', data=payload)
response.json()
Response:
{'code': 400,
'type': 'BadRequestException',
'message': 'Parameter model_name is required.'}
In my case, rather than building the docker container myself and pointing torchserve to /opt/ml/model
from my Dockerfile, I had tried to use the torchserve docker image from dockerhub(which doesn't seem to point to /opt/ml/model as the default model store directory).
@csaroff I believe you have identified the error is in the Dockerfile where the contents of model store are not copied to the container
I have a similar error and it was because that when I tar the .mar file to .tar.gz it was under a folder. So when SageMaker uncompress it the mar file is under a folder too. However TorchServe need the *.mar file on the top.
batch transform error:
2022-08-30T09:01:17.792:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=50, BatchStrategy=MULTI_RECORD -- 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: ClientError: 400 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: Message: 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: { 2022-08-30T09:01:17.883:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "code": 400, 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "type": "BadRequestException", 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: "message": "Parameter model_name is required." 2022-08-30T09:01:17.884:[sagemaker logs]: st-s3/trainingPlatform/model/ba0ba70ebb2c48f69c61240a199f7a24/inference/dataset/21f9bb9e2a39407691bdb18e04e1b672/202208170843490.jpg: }-------------------------------------------------my dockerfile is-------------------------------------------------
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04
LABEL maintainer="Amazon AI" LABEL dlc_major_version="1" LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
ARG PYTHON=python3 ARG PYTHON_VERSION=3.7.3 ARG OPEN_MPI_VERSION=4.0.1 ARG TS_VERSION="0.3.1" ARG PT_INFERENCE_URL=https://aws-pytorch-binaries.s3-us-west-2.amazonaws.com/r1.6.0_inference/20200727-223446/b0251e7e070e57f34ee08ac59ab4710081b41918/gpu/torch-1.6.0-cp36-cp36m-manylinux1_x86_64.whl ARG PT_VISION_URL=https://torchvision-build.s3.amazonaws.com/1.6.0/gpu/torchvision-0.7.0-cp36-cp36m-linux_x86_64.whl
ENV LANG C.UTF-8 ENV LD_LIBRARY_PATH /opt/conda/lib/:$LD_LIBRARY_PATH ENV PATH /opt/conda/bin:$PATH ENV SAGEMAKER_SERVING_MODULE sagemaker_pytorch_serving_container.serving:main ENV TEMP=/home/model-server/tmp
RUN apt-get update \ && apt-get install -y --no-install-recommends software-properties-common \ && add-apt-repository ppa:openjdk-r/ppa \ && apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ build-essential \ ca-certificates \ cmake \ curl \ emacs \ git \ jq \ libgl1-mesa-glx \ libglib2.0-0 \ libgomp1 \ libibverbs-dev \ libnuma1 \ libnuma-dev \ libsm6 \ libxext6 \ libxrender-dev \ openjdk-11-jdk \ vim \ wget \ unzip \ zlib1g-dev
RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \ mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \ /var/lib/dpkg/info/ca-certificates-java.postinst configure;
RUN wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-$OPEN_MPI_VERSION.tar.gz \ && gunzip -c openmpi-$OPEN_MPI_VERSION.tar.gz | tar xf - \ && cd openmpi-$OPEN_MPI_VERSION \ && ./configure --prefix=/home/.openmpi \ && make all install \ && cd .. \ && rm openmpi-$OPEN_MPI_VERSION.tar.gz \ && rm -rf openmpi-$OPEN_MPI_VERSION
ENV PATH="$PATH:/home/.openmpi/bin" ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"
RUN apt-get install -y --no-install-recommends \ openssh-client \ openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_configs
RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh \ && chmod +x ~/miniconda.sh \ && ~/miniconda.sh -b -p /opt/conda \ && rm ~/miniconda.sh RUN /opt/conda/bin/conda update conda \ && /opt/conda/bin/conda install -c conda-forge \ python=$PYTHON_VERSION \ && /opt/conda/bin/conda install -y \ cython==0.29.12 \ ipython==7.7.0 \ mkl-include==2019.4 \ mkl==2019.4 \ numpy==1.19.1 \ scipy==1.3.0 \ typing==3.6.4 \ && /opt/conda/bin/conda clean -ya
RUN conda install -c \ pytorch magma-cuda101 \ && conda install -c \ conda-forge \ opencv==4.0.1 \ && conda install -y \ scikit-learn==0.21.2 \ pandas==0.25.0 \ h5py==2.9.0 \ requests==2.22.0 \ && conda clean -ya \ && /opt/conda/bin/conda config --set ssl_verify False \ && pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \ && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \ && pip install packaging==20.4 \ enum-compat==0.0.3 \ ruamel-yaml
RUN pip install --no-cache-dir -U https://pypi.tuna.tsinghua.edu.cn/packages/5d/5e/35140615fc1f925023f489e71086a9ecc188053d263d3594237281284d82/torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl#sha256=87d65c01d1b70bb46070824f28bfd93c86d3c5c56b90cbbe836a3f2491d91c76 RUN pip uninstall -y torchvision \ && pip install --no-deps --no-cache-dir -U https://mirrors.aliyun.com/pypi/packages/4d/b5/60d5eb61f1880707a5749fea43e0ec76f27dfe69391cdec953ab5da5e676/torchvision-0.7.0-cp37-cp37m-manylinux1_x86_64.whl#sha256=0d1a5adfef4387659c7a0af3b72e16caa0c67224a422050ab65184d13ac9fb13
RUN pip uninstall -y model-archiver multi-model-server \ && pip install captum \ && pip install torchserve==$TS_VERSION \ && pip install torch-model-archiver==$TS_VERSION
RUN useradd -m model-server \ && mkdir -p /home/model-server/tmp /opt/ml/model \ && chown -R model-server /home/model-server /opt/ml/model
COPY torchserve-entrypoint.py /usr/local/bin/dockerd-entrypoint.py COPY config.properties /home/model-server
RUN chmod +x /usr/local/bin/dockerd-entrypoint.py
ADD https://raw.githubusercontent.com/aws/deep-learning-containers/master/src/deep_learning_container.py /usr/local/bin/deep_learning_container.py
RUN chmod +x /usr/local/bin/deep_learning_container.py
RUN pip install --no-cache-dir "sagemaker-pytorch-inference>=2"
RUN curl https://aws-dlc-licenses.s3.amazonaws.com/pytorch-1.6.0/license.txt -o /license.txt
RUN conda install -y -c conda-forge "pyyaml>5.4,<5.5" RUN pip install pillow==8.2.0 "awscli<2"
RUN python3 -m pip install detectron2==0.4 -f \ https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/index.html
RUN HOME_DIR=/root \ && curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \ && unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \ && cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance \ && chmod +x /usr/local/bin/testOSSCompliance \ && chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh \ && ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} \ && rm -rf ${HOME_DIR}/oss_compliance*
EXPOSE 8080 8081 ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"] CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]
Having successfully deployed the model and endpoint, at
client.invoke_endpoint
I receive error:Appears to be https://github.com/pytorch/serve/issues/631 with an update to the Dockerfile required (add captum)