aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
195 stars 86 forks source link

model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment #229

Open babarory opened 3 years ago

babarory commented 3 years ago

Hello everyone, I'm very new on sagemaker and I'm facing a strange issue that I can't solve.

My goal : I have created a CNN that I would like to train, build and deploy in a MLOPS pipeline with sagemaker.

First of all, I created a notebook instance in SageMaker in wich i created a wasteClassification.ipynb and a train.py file. The train.py file contain my neural network definition, some function to train and save it and several overwritted function : _modelfn, _predictfn, _inputfn. In my wasteClassification.ipynb I was able to create a PyTorch estimator, train the model, deploy the endpoint and make prediction using _invokeendpoint function without any issues.

After that, i decided to create a pipeline to automate training, building and deployment using the new sagemaker tool for that. I have created a sagemaker studio project based on the template MLOps template for model building, training, and deployment. This template provides two gitCommit repos : modelbuild and modeldeploy. I simply modified the modelbuild repo in wich I put my train.py script in the folder "/pipelines/abalone/" and I modified the file "pipelines/abalone/pipeline.py" in which I created a pytorch estimator linked to my train.py script. When the pipeline is lauched, I can see in the training job logs that my model is training without any issue and the final endpoint is created. But when I try to invoke the endpoint (_invokeendpoint), I have an error : _An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message " Please provide a modelfn implementation." This is strange because I did provide a model_fn implementation in my train.py file...

Do you have any idea to solve this issue ?

Soroush-aali-bagi commented 10 months ago

@babarory Did you find the answer?