Closed setu4993 closed 3 years ago
Enabled debugging in inference.py
and found the torch-model-archiver
command is
['torch-model-archiver', '--model-name', 'model', '--handler', '/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/handler_service.py', '--serialized-file', '/opt/ml/model/model.pth', '--export-path', '/.sagemaker/ts/models', '--extra-files', '/opt/ml/model/code/inference.py', '--version', '1']
@setu4993 I have tried this fix and it worked for me, could you please test it?
def _adapt_to_ts_format(handler_service):
if not os.path.exists(DEFAULT_TS_MODEL_DIRECTORY):
os.makedirs(DEFAULT_TS_MODEL_DIRECTORY)
extra_files = []
extra_files.append(os.path.join(environment.model_dir, DEFAULT_TS_CODE_DIR, environment.Environment().module_name + ".py"))
extra_files+= [os.path.join(environment.model_dir, file) for file
in os.listdir(environment.model_dir)
if os.path.isfile(os.path.join(environment.model_dir, file))]
extra_files.remove(os.path.join(environment.model_dir, DEFAULT_TS_MODEL_SERIALIZED_FILE))
logger.info(extra_files)
model_archiver_cmd = [
"torch-model-archiver",
"--model-name",
DEFAULT_TS_MODEL_NAME,
"--handler",
handler_service,
"--serialized-file",
os.path.join(environment.model_dir, DEFAULT_TS_MODEL_SERIALIZED_FILE),
"--export-path",
DEFAULT_TS_MODEL_DIRECTORY,
"--extra-files",
','.join(extra_files),
"--version",
"1",
]
logger.info(model_archiver_cmd)
subprocess.check_call(model_archiver_cmd)
_set_python_path()
@amaharek : Thank you for identifying what needs to be fixed and actually showing a PoC! From looking at the changes, I think it would work, though I'm not sure how to try it :/.
What I'd like, though, is this not being applied to include just other files but everything in environment.model_dir
. So, yes, all files, but such that the tree structure is maintained. I think that would be my ideal solution.
Would you be willing to submit a PR with the changes you've made above to this repo? If yes, that'd be great and I hope the maintainers would be willing to accept it. I'd be happy to add to it to add the functionality I describe above.
@amaharek This issue also happens to the just-released pytorch 1.7.1 inference image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04. I hope the fix can be pushed soon. Thanks.
@amaharek This issue also happens to the just-released pytorch 1.7.1 inference image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04. I hope the fix can be pushed soon. Thanks.
This issue happens to all pytorch inference containers > 1.6.0
@amaharek : Thank you for identifying what needs to be fixed and actually showing a PoC! From looking at the changes, I think it would work, though I'm not sure how to try it :/.
What I'd like, though, is this not being applied to include just other files but everything in
environment.model_dir
. So, yes, all files, but such that the tree structure is maintained. I think that would be my ideal solution.Would you be willing to submit a PR with the changes you've made above to this repo? If yes, that'd be great and I hope the maintainers would be willing to accept it. I'd be happy to add to it to add the functionality I describe above.
I understand that you are not sure how to test the above code on your end. Also, you want to include all files in the directory /opt/ml/model/
.
To test the above change, I have extended the aws container and created a new one in my account as following:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.6.0-cpu-py36-ubuntu16.04
RUN git clone https://github.com/aws/sagemaker-pytorch-inference-toolkit.git
COPY torchserve.py sagemaker-pytorch-inference-toolkit/src/sagemaker_pytorch_serving_container/torchserve.py
RUN pip install sagemaker-pytorch-inference-toolkit/.
Please check this sample notebook for more information about extending aws containers.
The above change ensures all files in /opt/ml/model
directory are included in the mar
file created but it doesn't go through the subdirectories. As per my tests, it's not required to include all the files in /opt/ml/model/code
. The only required file is /opt/ml/model/code/<module_name>.py
.
I have already made a PR and hopefully it will be merged.
Thanks
Thank you very much @amaharek, your PR changes worked for me. I'm using 1.7.1 and my approach was the following, if it helps anyone else out.
Extended the base 1.7.1 image, uninstalled sagemaker-pytorch-inference and replaced it with a patched version from a local branch on my machine. Simply overwriting the existing torchserve.py works as well.
Dockerfile:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04
COPY sagemaker-pytorch-inference-toolkit /tmp/sagemaker-pytorch-inference-toolkit
RUN pip uninstall sagemaker-pytorch-inference -y && \
pip install --no-cache-dir /tmp/sagemaker-pytorch-inference-toolkit
Pushed the extended image to my private ECR repo
Deployed using this image
Example:
model = PyTorchModel(
name='<model_name>',
model_data='s3://<s3_bucket>/.../model.tar.gz',
code_location='s3://<s3_bucket>/.../code',
role=role,
entry_point='inference.py',
framework_version='1.7.1',
image_uri='xxxxx.ecr.us-east-1.amazonaws.com/pytorch-inference:1.7.1-gpu-py36-cu110-ubuntu18.04-fixed'
)
I am now packing all dependencies in the model.tar.gz archive, rather than dealing with source_dir or dependencies parameters, which are confusing and unnecessary.
The structure of model.tar.gz is as follows:
model.pth
code/
code/inference.py
code/requirements.txt
code/extra_lib/
The extra_lib, which I need to access from my inference.py is available in /opt/ml/model/code/extra_lib.
Hopefully this fix gets merged soon so others do not have to spend a ton of time debugging..
Happy to know that it worked for you @dectl
I think the reason why the parameters like source_dir
and dependencies
are used is because ideally the output of the training job is the model stored on S3 model.tar.gz
.
When you create the model.tar.gz
with the mentioned structure, another step of archiving the model files should be done before launching the endpoint which might not be the best practice.
This issue was fixed in TS0.3.1 and toolkitv2.0.5.. These fixes are available in latest SM DLC.
Can confirm that this works for me, closing this issue. Thanks for your help, @lxning!
And thanks @amaharek for the attempted fix.
See https://github.com/aws/sagemaker-python-sdk/issues/1909.