model_fn and input_fn called multiple times

aunz commented 4 years ago

I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0. In the entry_point,I include a script which carries out the batch transformation job.

def model_fn():
    ...    

def input_fn():
    ...

def predict_fn():
    '''
        A long running process to preprocess the data before calling model
        https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
    '''
    time.sleep(60 * 11) # sleep for 11 mins to simulate a long running process
    ....

def output_fn():
    ....

I noticed that the model_fn() was called multiple times in the cloudwatch log

21:11:43 model_fn called /opt/ml/model 0.3710819465747405
21:11:43 model_fn called /opt/ml/model 0.1368146211634631
21:11:44 model_fn called /opt/ml/model 0.09153953459183728

The input_fn() was also called multiple times

20:41:31 input_data <class 'str'> application/json 0.3936440317990033 {
20:51:30 input_data <class 'str'> application/json 0.4852180186010707 {
21:01:30 input_data <class 'str'> application/json 0.9954036507047136 {
21:11:30 input_data <class 'str'> application/json 0.0806271844985188 {

Precisely, it's called every 10 minutes.

I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".

I saw a similar related issue before https://github.com/awslabs/amazon-sagemaker-examples/issues/341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.