Open tuanavu opened 9 months ago
@tuanavu Thanks for your feedback, we have decoupled/reorganized third parties dependency HPS depends on after 23.06. Since we have pre-installed all HPS/SOK related libraries, there is no need to set LD_PRELOAD. If you must set some custom library file paths, it is recommended to use LD_LIBRARY_PATH to set. FYI @EmmaQiaoCh
@bashimao Please add your comment about reorganizing third parties dependency.
Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09
, do you know how to resolve this?
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib
that I saw in the container.
Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used
nvcr.io/nvidia/merlin/merlin-tensorflow:23.09
, do you know how to resolve this?2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle. 2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook 2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel 2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook
And here's the
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib
that I saw in the container.
Please provide more details on which step in the notebook outputs these error messages.
Sure, Steps to reproduce the behavior:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02
and 23.06
by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
Sure, Steps to reproduce the behavior:
- Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
- Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
- Check the container log and see this error:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle. 2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook 2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel 2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook
- Try sending some serving request or run perf_analyzer and still seeing the same error.
Note that the same model can be deployed and test successfully with
merlin-tensorflow:23.02
and23.06
by settingLD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
From the brief reproduction steps you provided, I still haven't figured out which specific step you met these errors. So I can only guess that you have successfully completed the model training and created_and_save_inference_graph, and then you met the error in this step (Deploy SavedModel using HPS with Triton TensorFlow Backend)
Since we do not have the same AWS environment to reproduce your issue, we have not reproduced the same issue you encountered in the local machine(T4/V100, Intel CPU, Ubuntu 22.04 with 23.12 container ). However, there is an important note here. You still only need to set the LD_PRELOAD when launching the tritonserver (Cell 13 in the Notebook)as show below, which is the registration mechanism for the custom op required by the triton server. In addition, there is no need to set LD_PRELOAD in any other steps.
LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit
Hi @yingcanw,
The issue seems to circle back to the initial problem discussed in this thread: https://github.com/NVIDIA-Merlin/HugeCTR/issues/440#issue-2119408487. When I set LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
when launching tritonserver exactly as you describe above, I encountered a segmentation fault. This error appears to be consistent across the merlin-tensorflow
images from versions 23.08 to 23.12. However, with the 23.12 image, the LD_PRELOAD path pointing to Python 3.8 libraries should no longer be applicable as you mentioned. Could you attempt to reproduce this error using the 23.09 container and provide any findings?
Thank you for your corrections. There is a typo here. We have upgraded python to 3.10 since 23.08, and we need to update the notebook to modify the triton server launch command.
However, I still haven't reproduced the issue you mentioned on 23.09. But I still want to emphasize the difference in https://github.com/NVIDIA-Merlin/HugeCTR/issues/440#issue-2119408487 , users are asked not to set the LD_PRELOAD variable independently(please pay attention to the bold part in the log), LD_PRELOAD is used as an argument when launching the triton server rather than an environment variable. In other words, LD_PRELOAD should not be set independently as an environment variable.
Hope the above information is more clear for you to solve the problem.
docker run --gpus=all --privileged=true --net=host --shm-size 16g -it -v ${PWD}:/hugectr/ nvcr.io/nvidia/merlin/merlin-tensorflow:23.09 /bin/bash
================================== == Triton Inference Server Base ==
NVIDIA Release 23.06 (build 62878575)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.85.12. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit**** I0207 07:40:33.150916 129 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9316000000' with size 268435456 I0207 07:40:33.156832 129 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864.. W0207 07:40:33.906804 129 server.cc:248] failed to enable peer access for some device pairs W0207 07:40:33.922110 129 model_lifecycle.cc:108] ignore version directory 'hps_tf_triton_sparse_0.model' which fails to convert to integral number I0207 07:40:33.922153 129 model_lifecycle.cc:462] loading: hps_tf_triton:1 I0207 07:40:34.251125 129 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow I0207 07:40:34.251157 129 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13 I0207 07:40:34.251161 129 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13
2024-02-07 07:40:37.339528: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 89817 microseconds. I0207 07:40:37.340078 129 model_lifecycle.cc:815] successfully loaded 'hps_tf_triton' I0207 07:40:37.340295 129 server.cc:603] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+
I0207 07:40:37.340372 129 server.cc:630] +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | tensorflow | /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 | | | | 0","version":"2","default-max-batch-size":"4"}} | +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
I0207 07:40:37.340439 129 server.cc:673] +---------------+---------+--------+ | Model | Version | Status | +---------------+---------+--------+ | hps_tf_triton | 1 | READY | +---------------+---------+--------+
I... ....
I0207 07:40:37.469253 129 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001 I0207 07:40:37.469536 129 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000 I0207 07:40:37.511302 129 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
Hi @yingcanw,
Quick update, I believe I figure out the root cause of the seg fault. It appears to be related to configuring the --model-repository
flag to point to a remote S3 bucket. My suspicion is that the underlying issue is with the aws-sdk-cpp
. This is based on observing similar errors, such as "Error: free(): invalid pointer", when manually interacting with S3 objects inside the container.
Steps to reproduce the behavior:
merlin-tensorflow:23.09
container.
name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \
tritonserver --model-repository=s3://model_repo \
--backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \
--grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true
2024-02-08 08:25:42.570293: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
I0208 08:25:42.928109 28 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f80ba000000' with size 268435456
I0208 08:25:42.930215 28 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
Segmentation fault (core dumped)
* Contrastingly, when the S3 model_repo is downloaded to a local directory and used as follows, no segmentation fault occurs, and the model starts as normal.
name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \ tritonserver --model-repository=/tmp/models \ --backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \ --grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true
2024-02-08 08:27:19.845263: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
I0208 08:27:20.222846 33 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fcd94000000' with size 268435456
I0208 08:27:20.225011 33 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0208 08:27:20.230161 33 model_lifecycle.cc:462] loading: 1200429:1
I0208 08:27:20.534202 33 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0208 08:27:20.534240 33 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13
I0208 08:27:20.534256 33 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13
Thanks a lot for your update, I think this error output can be easily misunderstood(we have verified that if the LD_PRELOAD parameter is set, circular dependencies will cause seg fault). But we don't have the same AWS environment to reproduce this problem.
Due to lazy initialization of HPS, which is not initialized (although hps currently does not support parsing embedding from remote repo and will output file cannot be opened
errors message instead of seg fault) until the first inference request is processed. So I think the you may needs to submit an issue to tensorflow_backend.
Describe the bug
I've encountered a segmentation fault while deploying a TensorFlow model with Hierarchical Parameter Server (HPS) following the instructions provided in the HPS TensorFlow Triton deployment demo notebook. This issue has been consistent across Merlin-TensorFlow images from
merlin-tensorflow:23.08
tomerlin-tensorflow:23.12
, which utilize Python 3.10.Note that the issue doesn’t happen with merlin-tensorflow <= 23.06 that uses Python 3.8
When deploying in a Kubernetes environment with the environment variable LD_PRELOAD set to /usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so, the Triton inference server container terminates unexpectedly with exit code 139. Trying to import the HPS library within the container also leads to a segmentation fault.
Error logs
Error: container triton terminated with exit code 139.
To Reproduce Steps to reproduce the behavior:
merlin-tensorflow:23.09
and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.Environment (please complete the following information):
Additional context The error suggests there might be an incompatibility issue with the Python version or a problem with the HPS. Any insights or solutions to this problem would be greatly appreciated.