NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

[BUG] Seg Fault When Deploying TF+HPS Model with merlin-tensorflow #440

Open tuanavu opened 4 months ago

tuanavu commented 4 months ago

Describe the bug

I've encountered a segmentation fault while deploying a TensorFlow model with Hierarchical Parameter Server (HPS) following the instructions provided in the HPS TensorFlow Triton deployment demo notebook. This issue has been consistent across Merlin-TensorFlow images from merlin-tensorflow:23.08 to merlin-tensorflow:23.12, which utilize Python 3.10.

Note that the issue doesn’t happen with merlin-tensorflow <= 23.06 that uses Python 3.8

When deploying in a Kubernetes environment with the environment variable LD_PRELOAD set to /usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so, the Triton inference server container terminates unexpectedly with exit code 139. Trying to import the HPS library within the container also leads to a segmentation fault.

Error logs

$python3 -c "import hierarchical_parameter_server as hps"
[INFO] hierarchical_parameter_server is imported
2024-02-05 19:42:04.215077: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Segmentation fault (core dumped)

To Reproduce Steps to reproduce the behavior:

  1. Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
  2. Construct a deployment.yaml and deploy on AWS EKS, setting LD_PRELOAD to /usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so.
  3. Observe the following error: Error: container triton terminated with exit code 139.
  4. SSH into the container and execute $ python3 -c "import hierarchical_parameter_server as hps" to encounter the segmentation fault as described above.

Environment (please complete the following information):

Additional context The error suggests there might be an incompatibility issue with the Python version or a problem with the HPS. Any insights or solutions to this problem would be greatly appreciated.

yingcanw commented 4 months ago

@tuanavu Thanks for your feedback, we have decoupled/reorganized third parties dependency HPS depends on after 23.06. Since we have pre-installed all HPS/SOK related libraries, there is no need to set LD_PRELOAD. If you must set some custom library file paths, it is recommended to use LD_LIBRARY_PATH to set. FYI @EmmaQiaoCh

@bashimao Please add your comment about reorganizing third parties dependency.

tuanavu commented 4 months ago

Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09, do you know how to resolve this?

2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook

And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib that I saw in the container.

yingcanw commented 4 months ago

Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09, do you know how to resolve this?

2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
  While importing function: __inference_call_7323150
  when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
  While importing function: __inference_call_7323150
  when importing GraphDef to MLIR module in GrapplerHook

And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib that I saw in the container.

Please provide more details on which step in the notebook outputs these error messages.

tuanavu commented 4 months ago

Sure, Steps to reproduce the behavior:

  1. Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
  2. Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
  3. Check the container log and see this error:
    2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
    2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
    2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
    2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
  4. Try sending some serving request or run perf_analyzer and still seeing the same error.

Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02 and 23.06 by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so

yingcanw commented 4 months ago

Sure, Steps to reproduce the behavior:

  1. Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
  2. Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
  3. Check the container log and see this error:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
  1. Try sending some serving request or run perf_analyzer and still seeing the same error.

Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02 and 23.06 by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so

From the brief reproduction steps you provided, I still haven't figured out which specific step you met these errors. So I can only guess that you have successfully completed the model training and created_and_save_inference_graph, and then you met the error in this step (Deploy SavedModel using HPS with Triton TensorFlow Backend)

Since we do not have the same AWS environment to reproduce your issue, we have not reproduced the same issue you encountered in the local machine(T4/V100, Intel CPU, Ubuntu 22.04 with 23.12 container ). However, there is an important note here. You still only need to set the LD_PRELOAD when launching the tritonserver (Cell 13 in the Notebook)as show below, which is the registration mechanism for the custom op required by the triton server. In addition, there is no need to set LD_PRELOAD in any other steps.

LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit

tuanavu commented 4 months ago

Hi @yingcanw,

The issue seems to circle back to the initial problem discussed in this thread: https://github.com/NVIDIA-Merlin/HugeCTR/issues/440#issue-2119408487. When I set LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so when launching tritonserver exactly as you describe above, I encountered a segmentation fault. This error appears to be consistent across the merlin-tensorflow images from versions 23.08 to 23.12. However, with the 23.12 image, the LD_PRELOAD path pointing to Python 3.8 libraries should no longer be applicable as you mentioned. Could you attempt to reproduce this error using the 23.09 container and provide any findings?

yingcanw commented 4 months ago

Thank you for your corrections. There is a typo here. We have upgraded python to 3.10 since 23.08, and we need to update the notebook to modify the triton server launch command.

However, I still haven't reproduced the issue you mentioned on 23.09. But I still want to emphasize the difference in https://github.com/NVIDIA-Merlin/HugeCTR/issues/440#issue-2119408487 , users are asked not to set the LD_PRELOAD variable independently(please pay attention to the bold part in the log), LD_PRELOAD is used as an argument when launching the triton server rather than an environment variable. In other words, LD_PRELOAD should not be set independently as an environment variable.

Hope the above information is more clear for you to solve the problem.

docker run --gpus=all --privileged=true --net=host --shm-size 16g -it -v ${PWD}:/hugectr/ nvcr.io/nvidia/merlin/merlin-tensorflow:23.09 /bin/bash

================================== == Triton Inference Server Base ==

NVIDIA Release 23.06 (build 62878575)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.85.12. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit**** I0207 07:40:33.150916 129 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9316000000' with size 268435456 I0207 07:40:33.156832 129 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864.. W0207 07:40:33.906804 129 server.cc:248] failed to enable peer access for some device pairs W0207 07:40:33.922110 129 model_lifecycle.cc:108] ignore version directory 'hps_tf_triton_sparse_0.model' which fails to convert to integral number I0207 07:40:33.922153 129 model_lifecycle.cc:462] loading: hps_tf_triton:1 I0207 07:40:34.251125 129 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow I0207 07:40:34.251157 129 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13 I0207 07:40:34.251161 129 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13

2024-02-07 07:40:37.339528: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 89817 microseconds. I0207 07:40:37.340078 129 model_lifecycle.cc:815] successfully loaded 'hps_tf_triton' I0207 07:40:37.340295 129 server.cc:603] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0207 07:40:37.340372 129 server.cc:630] +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | tensorflow | /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 | | | | 0","version":"2","default-max-batch-size":"4"}} | +------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I0207 07:40:37.340439 129 server.cc:673] +---------------+---------+--------+ | Model | Version | Status | +---------------+---------+--------+ | hps_tf_triton | 1 | READY | +---------------+---------+--------+

I... ....

I0207 07:40:37.469253 129 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001 I0207 07:40:37.469536 129 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000 I0207 07:40:37.511302 129 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002

tuanavu commented 4 months ago

Hi @yingcanw,

Quick update, I believe I figure out the root cause of the seg fault. It appears to be related to configuring the --model-repository flag to point to a remote S3 bucket. My suspicion is that the underlying issue is with the aws-sdk-cpp. This is based on observing similar errors, such as "Error: free(): invalid pointer", when manually interacting with S3 objects inside the container.

Steps to reproduce the behavior:

2024-02-08 08:25:42.570293: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. I0208 08:25:42.928109 28 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f80ba000000' with size 268435456 I0208 08:25:42.930215 28 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 Segmentation fault (core dumped)


* Contrastingly, when the S3 model_repo is downloaded to a local directory and used as follows, no segmentation fault occurs, and the model starts as normal.

name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \ tritonserver --model-repository=/tmp/models \ --backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \ --grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true

2024-02-08 08:27:19.845263: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. I0208 08:27:20.222846 33 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fcd94000000' with size 268435456 I0208 08:27:20.225011 33 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0208 08:27:20.230161 33 model_lifecycle.cc:462] loading: 1200429:1 I0208 08:27:20.534202 33 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow I0208 08:27:20.534240 33 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13 I0208 08:27:20.534256 33 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13

yingcanw commented 4 months ago

Thanks a lot for your update, I think this error output can be easily misunderstood(we have verified that if the LD_PRELOAD parameter is set, circular dependencies will cause seg fault). But we don't have the same AWS environment to reproduce this problem. Due to lazy initialization of HPS, which is not initialized (although hps currently does not support parsing embedding from remote repo and will output file cannot be opened errors message instead of seg fault) until the first inference request is processed. So I think the you may needs to submit an issue to tensorflow_backend.