Switching to TorchServe for 1.6.0 inference has caused undocumented breaking changes and regression

setu4993 commented 4 years ago

See https://github.com/aws/sagemaker-python-sdk/issues/1909.

amaharek commented 3 years ago

Enabled debugging in inference.py and found the torch-model-archiver command is

['torch-model-archiver', '--model-name', 'model', '--handler', '/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/handler_service.py', '--serialized-file', '/opt/ml/model/model.pth', '--export-path', '/.sagemaker/ts/models', '--extra-files', '/opt/ml/model/code/inference.py', '--version', '1']

amaharek commented 3 years ago

@setu4993 I have tried this fix and it worked for me, could you please test it?

def _adapt_to_ts_format(handler_service):
    if not os.path.exists(DEFAULT_TS_MODEL_DIRECTORY):
        os.makedirs(DEFAULT_TS_MODEL_DIRECTORY)

    extra_files = []
    extra_files.append(os.path.join(environment.model_dir, DEFAULT_TS_CODE_DIR, environment.Environment().module_name + ".py"))
    extra_files+= [os.path.join(environment.model_dir, file) for file
                                        in os.listdir(environment.model_dir)
                                        if os.path.isfile(os.path.join(environment.model_dir, file))]
    extra_files.remove(os.path.join(environment.model_dir, DEFAULT_TS_MODEL_SERIALIZED_FILE))
    logger.info(extra_files)

    model_archiver_cmd = [
        "torch-model-archiver",
        "--model-name",
        DEFAULT_TS_MODEL_NAME,
        "--handler",
        handler_service,
        "--serialized-file",
        os.path.join(environment.model_dir, DEFAULT_TS_MODEL_SERIALIZED_FILE),
        "--export-path",
        DEFAULT_TS_MODEL_DIRECTORY,
        "--extra-files",
        ','.join(extra_files),
        "--version",
        "1",
    ]

    logger.info(model_archiver_cmd)
    subprocess.check_call(model_archiver_cmd)

    _set_python_path()

setu4993 commented 3 years ago

@amaharek : Thank you for identifying what needs to be fixed and actually showing a PoC! From looking at the changes, I think it would work, though I'm not sure how to try it :/.

What I'd like, though, is this not being applied to include just other files but everything in environment.model_dir. So, yes, all files, but such that the tree structure is maintained. I think that would be my ideal solution.

Would you be willing to submit a PR with the changes you've made above to this repo? If yes, that'd be great and I hope the maintainers would be willing to accept it. I'd be happy to add to it to add the functionality I describe above.

ldong87 commented 3 years ago

@amaharek This issue also happens to the just-released pytorch 1.7.1 inference image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04. I hope the fix can be pushed soon. Thanks.

amaharek commented 3 years ago

@amaharek This issue also happens to the just-released pytorch 1.7.1 inference image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04. I hope the fix can be pushed soon. Thanks.

This issue happens to all pytorch inference containers > 1.6.0

amaharek commented 3 years ago

@amaharek : Thank you for identifying what needs to be fixed and actually showing a PoC! From looking at the changes, I think it would work, though I'm not sure how to try it :/.

What I'd like, though, is this not being applied to include just other files but everything in environment.model_dir. So, yes, all files, but such that the tree structure is maintained. I think that would be my ideal solution.

Would you be willing to submit a PR with the changes you've made above to this repo? If yes, that'd be great and I hope the maintainers would be willing to accept it. I'd be happy to add to it to add the functionality I describe above.

I understand that you are not sure how to test the above code on your end. Also, you want to include all files in the directory /opt/ml/model/.

To test the above change, I have extended the aws container and created a new one in my account as following:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.6.0-cpu-py36-ubuntu16.04

RUN git clone https://github.com/aws/sagemaker-pytorch-inference-toolkit.git
COPY torchserve.py sagemaker-pytorch-inference-toolkit/src/sagemaker_pytorch_serving_container/torchserve.py 
RUN pip install sagemaker-pytorch-inference-toolkit/.

Please check this sample notebook for more information about extending aws containers.

The above change ensures all files in /opt/ml/model directory are included in the mar file created but it doesn't go through the subdirectories. As per my tests, it's not required to include all the files in /opt/ml/model/code. The only required file is /opt/ml/model/code/<module_name>.py.

I have already made a PR and hopefully it will be merged.

Thanks

declark1 commented 3 years ago

Thank you very much @amaharek, your PR changes worked for me. I'm using 1.7.1 and my approach was the following, if it helps anyone else out.

Extended the base 1.7.1 image, uninstalled sagemaker-pytorch-inference and replaced it with a patched version from a local branch on my machine. Simply overwriting the existing torchserve.py works as well.

Dockerfile:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04
COPY sagemaker-pytorch-inference-toolkit /tmp/sagemaker-pytorch-inference-toolkit
RUN pip uninstall sagemaker-pytorch-inference -y && \
  pip install --no-cache-dir /tmp/sagemaker-pytorch-inference-toolkit

Pushed the extended image to my private ECR repo
Deployed using this image

Example:

  model = PyTorchModel(
    name='<model_name>',
    model_data='s3://<s3_bucket>/.../model.tar.gz',
    code_location='s3://<s3_bucket>/.../code',
    role=role,
    entry_point='inference.py',
    framework_version='1.7.1',
    image_uri='xxxxx.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04-fixed'
  )

I am now packing all dependencies in the model.tar.gz archive, rather than dealing with source_dir or dependencies parameters, which are confusing and unnecessary.

The structure of model.tar.gz is as follows:

  model.pth
  code/
  code/inference.py
  code/requirements.txt
  code/extra_lib/

The extra_lib, which I need to access from my inference.py is available in /opt/ml/model/code/extra_lib.

Hopefully this fix gets merged soon so others do not have to spend a ton of time debugging..

amaharek commented 3 years ago

Happy to know that it worked for you @dectl

I think the reason why the parameters like source_dir and dependencies are used is because ideally the output of the training job is the model stored on S3 model.tar.gz.

When you create the model.tar.gz with the mentioned structure, another step of archiving the model files should be done before launching the endpoint which might not be the best practice.

lxning commented 3 years ago

This issue was fixed in TS0.3.1 and toolkitv2.0.5.. These fixes are available in latest SM DLC.

setu4993 commented 3 years ago

Can confirm that this works for me, closing this issue. Thanks for your help, @lxning!

And thanks @amaharek for the attempted fix.

aws / sagemaker-pytorch-inference-toolkit

Switching to TorchServe for 1.6.0 inference has caused undocumented breaking changes and regression #85