[Please Help!!] Error when hosting tensorflow endpoint in script mode

Janelle-He commented 3 years ago

I tried to build my own tensorflow algorithm in train.py by adopting from mnist-2.py (available in amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/) and pass it as entry_point using the pre-built deep learning image. The training job completed with warning and there is an error in deploying the model.

Below is the main function in train.py:

if __name__ == "__main__":
    args, unknown = _parse_args()

    train_data, train_labels = _load_training_data(args.train)
    eval_data, eval_labels = _load_testing_data(args.train)

    print('Training model for {} epochs and {} batch size..\n\n'.format(args.epochs, args.batch_size))

    classifier = model(train_data, train_labels, eval_data, eval_labels, epochs=args.epochs, batch_size=args.batch_size)

    if args.current_host == args.hosts[0]:
        # save model in SaveModel format
        classifier.save(os.path.join(args.sm_model_dir, "nn_model.h5"))

        # save model in Keras h5 format
        classifier.save(os.path.join(args.sm_model_dir, "nn_classifier"))

The estimator created in the notebook instance is as follows:

estimator = TensorFlow(entry_point='train.py',
                     role=sagemaker.get_execution_role(),
                     distribution={"parameter_server": {"enabled": True}},
                     image_uri='763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.6.0-cpu-py38-ubuntu20.04',
                     training_steps= 100,
                     evaluation_steps= 100,
                     instance_count=2,
                     instance_type='ml.m5.4xlarge',
                     hyperparameters={
                         'epochs': EPOCHS,
                         'batch_size': BATCH_SIZE
                     })

Here are the problems I encountered:

Warnining: no model artifact is saved under path /opt/ml/model. However, I did find a model.tar.gz in the output folder of this training job in S3.

2021-10-29 23:48:58 Uploading - Uploading generated training model 2021-10-29 23:48:58 Completed - Training job completed 2021-10-29 23:48:48,662 sagemaker_tensorflow_container.training INFO master algo-1 is down, stopping parameter server 2021-10-29 23:48:48,663 sagemaker_tensorflow_container.training WARNING No model artifact is saved under path /opt/ml/model. Your training job will not save any model files to S3. For details of how to construct your training script see: https://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script 2021-10-29 23:48:48,663 sagemaker-training-toolkit INFO Reporting training SUCCESS
When deploying the estimator after .fit, by running
```
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')
```
- It first pops out below message
  
  update_endpoint is a no-op in sagemaker>=2. See: https://sagemaker.readthedocs.io/en/stable/v2.html for details. I don't know how the update_endpoint() would cause an issue given it's still at creating stage and how to implement the sagemaker.predictor.Predictor.update_endpoint() mentioned in the refernce link in my train.py script or the notebook.

After half an hour of creating the endpoint it fails by showing

UnexpectedStatusException: Error hosting endpoint tensorflow-training-2021-10-29-23-53-53-781: Failed. > Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

So I also checked the CloudWatch logs for this endpoint and they all show the same event of

exec: "serve": executable file not found in $PATH

Can anyone help me with these problems? Many thanks in advance!

samaravazquezperez commented 2 years ago

Having the same issue here, did you manage to solve it?

Janelle-He commented 2 years ago

Having the same issue here, did you manage to solve it?

Nope, I haven't solve it....

francomedin commented 1 year ago

Hi guys! I´m having the same problem.

francomedin commented 1 year ago

Hi! I come back with the solution, after playing and reading a lot. I decided to try another image_uri and it works. So In training I used the following aws sagemaker image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.2.0-gpu-py37-cu102-ubuntu18.04

It works and I saved the model artifacts, then I loaded it:

from sagemaker.tensorflow import TensorFlowModel
model = TensorFlowModel(model_data='s3://...../output/model.tar.gz',
role=role,
image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.2.0-gpu-py37-cu102-ubuntu18.04' )

As you can see I changed the image_uri from training to inference. After that I was able to create the endpoing. I hopo this help you in your projects!

Tieck-IT commented 1 year ago

@Janelle-He Hi, I met same error (TODAY!!!) and fix like that. At my case, I make wrong way to .tar.gz file.

import tarfile
# before(error!)
with tarfile.open(f"api_version/{api_version}_{postfix}.tar.gz", "w") as f:
# after(fix)
with tarfile.open(f"api_version/{api_version}_{postfix}.tar.gz", "w:gz", format=tarfile.GNU_FORMAT) as f:

aws / amazon-sagemaker-examples

[Please Help!!] Error when hosting tensorflow endpoint in script mode #3000