aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.11k stars 6.77k forks source link

model_fn and input_fn called multiple times #1073

Open aunz opened 4 years ago

aunz commented 4 years ago

I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0. In the entry_point,I include a script which carries out the batch transformation job.

def model_fn():
    ...    

def input_fn():
    ...

def predict_fn():
    '''
        A long running process to preprocess the data before calling model
        https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
    '''
    time.sleep(60 * 11) # sleep for 11 mins to simulate a long running process
    ....

def output_fn():
    ....

I noticed that the model_fn() was called multiple times in the cloudwatch log

21:11:43 model_fn called /opt/ml/model 0.3710819465747405
21:11:43 model_fn called /opt/ml/model 0.1368146211634631
21:11:44 model_fn called /opt/ml/model 0.09153953459183728

The input_fn() was also called multiple times

20:41:31 input_data <class 'str'> application/json 0.3936440317990033 {
20:51:30 input_data <class 'str'> application/json 0.4852180186010707 {
21:01:30 input_data <class 'str'> application/json 0.9954036507047136 {
21:11:30 input_data <class 'str'> application/json 0.0806271844985188 {

Precisely, it's called every 10 minutes.

I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".

I saw a similar related issue before https://github.com/awslabs/amazon-sagemaker-examples/issues/341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.

ikennanwosu commented 4 years ago

How did you resolve this please, as I am getting the same issue.

EKami commented 3 years ago

Same issue here =/

raydazn commented 3 years ago

Same issue here. If model_fu provides functionality of loading model, do me need to load it for every batch?

uday1212 commented 1 year ago

Same issue here ..!!! Anyone found a solution to this.?

naresh129 commented 1 year ago

How is this issue solved. same issue here too..

llealgt commented 3 weeks ago

Has anyone found a solution? I'm facing the same issue, the function runs 4 times, it seems like 1 time per GPU available.

HubGab-Git commented 2 weeks ago

Can you show your code? I would like to reproduce it