NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
722 stars 112 forks source link

[BUG] Cannot be deployed to SageMaker Training #1023

Closed liyunrui closed 10 months ago

liyunrui commented 1 year ago

I'm executed the notebook in [1] on my AWS environment. But got below error:

2023-06-23 06:01:56,268 sagemaker-training-toolkit ERROR Reporting training FAILURE 2023-06-23 06:01:56,268 sagemaker-training-toolkit ERROR Framework Error: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/sagemaker_training/trainer.py", line 99, in train entry_point.run( File "/usr/local/lib/python3.8/dist-packages/sagemaker_training/entry_point.py", line 93, in run install(name=user_entry_point, path=environment.code_dir, capture_error=capture_error) File "/usr/local/lib/python3.8/dist-packages/sagemaker_training/entry_point.py", line 118, in install entry_point_type = _entry_point_type.get(path, name) File "/usr/local/lib/python3.8/dist-packages/sagemaker_training/_entry_point_type.py", line 43, in get if name.endswith(".sh"): AttributeError: 'NoneType' object has no attribute 'endswith' 'NoneType' object has no attribute 'endswith' 2023-06-23 06:01:56,268 sagemaker-training-toolkit ERROR Encountered exit_code 1

2023-06-23 06:02:32 Failed - Training job failed ProfilerReport-1687499644: Stopping

[1]. https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/sagemaker-tensorflow/sagemaker-merlin-tensorflow.ipynb

liyunrui commented 1 year ago

New Error

ERROR: Directory with provided SAGEMAKER_TRITON_DEFAULT_MODEL_NAME executor_model does not exist


liyunrui commented 1 year ago

We're running [1] on SM notebook. For SM training, everything works as expected. However, it's missing executor_model in the modle.tar.gz.

We have seen 0_transformworkflow 1_predicttensorflow ensemble_model

but not seen executor_model

[1]. https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/sagemaker-tensorflow/sagemaker-merlin-tensorflow.ipynb

oliverholworthy commented 1 year ago

We have seen 0_transformworkflow 1_predicttensorflow ensemble_model

Please check which version of the merlin-systems package you have installed. Since the 23.02.00 release version the deafult entrypoint model is called executor_model (was previously called ensemble_model).

https://nvidia.slack.com/archives/C03TJEJ647J/p1687882184158249?thread_ts=1687876077.088789&cid=C03TJEJ647J