NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
758 stars 113 forks source link

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image #1039

Open rnyak opened 1 year ago

rnyak commented 1 year ago

Bug description

One of our customers are having an issue of reproducing the training and serving Merlin on AWS SM example and they get an error (will be provided eventually).

The documentation also should be improved/clarified since it is not clear how one can generate the dataset in Generating Dataset without installing Merlin libs, and using Merlin image.

Steps/Code to reproduce bug

The notebooks should be tested with the latest stable merlin-tensorflow image, and updated if required. Currently, in the example merlin-tensorflow:22.10 image is used.

Expected behavior

Environment details

Additional context

edknv commented 1 year ago

I'm working on updating the merlin-tensorflow image to 23.06 here: https://github.com/NVIDIA-Merlin/Merlin/pull/1040.

After bumping the image version to 23.06 and updating the processing workflow in train.py to reflect recent changes, and running the updated example on AWS, we are getting an error:

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.workflow.TransformWorkflowTriton object at 0x7fe7df82a160>
RuntimeError: Failed for execute the inference request. Model '0_transformworkflowtriton' is not ready.

which doesn't tell us much what is going wrong. I'll try to run the container locally to debug.

rnyak commented 1 year ago

I'm working on updating the merlin-tensorflow image to 23.06 here: #1040.

After bumping the image version to 23.06 and updating the processing workflow in train.py to reflect recent changes, and running the updated example on AWS, we are getting an error:

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.workflow.TransformWorkflowTriton object at 0x7fe7df82a160>
RuntimeError: Failed for execute the inference request. Model '0_transformworkflowtriton' is not ready.

which doesn't tell us much what is going wrong. I'll try to run the container locally to debug.

thanks @edknv !

wei-m-teh commented 12 months ago

@edknv are there any update on this issue? I am trying to deploy a Merlin model to Sagemaker following the example given. I am running into the same issue.

edknv commented 11 months ago

@wei-m-teh Apologies for the delay. It's in review at the moment, but I updated #1040 with a workaround I found for making the notebook work with the latest 23.08 image.

rnyak commented 11 months ago

@wei-m-teh can you please test this PR at your end? thanks.