NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] Criteo/HugeCTR integration test fails while building the Merlin HugeCTR image #1643

Closed karlhigley closed 2 years ago

karlhigley commented 2 years ago

Describe the bug The Criteo/HugeCTR integration test fails while building the Merlin HugeCTR image

Steps/Code to reproduce bug Run the Merlin HugeCTR image job on Blossom

Expected behavior Tests should pass

Additional context


----------------------------- Captured stderr call -----------------------------
2022-08-11 12:46:20,977 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
2022-08-11 12:53:49,162 - distributed.worker - WARNING - Compute Failed
Key:       ('write-processed-5000e7bf28eaf54ea09d7a005af2c277-partition5000e7bf28eaf54ea09d7a005af2c277', "('part_0.parquet',)")
Function:  _write_subgraph
args:      (<merlin.io.dask.DaskSubgraph object at 0x7f047ccab370>, ('part_0.parquet',), '/tmp/pytest-of-root/pytest-4/test_criteo_hugectr0/tests/crit_test/train/', <Shuffle.PER_PARTITION: 0>, <fsspec.implementations.local.LocalFileSystem object at 0x7f101f48b7c0>, ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26'], ['I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11', 'I12', 'I13'], ['label'], 'parquet', 0, False, '')
kwargs:    {}
Exception: "MemoryError('std::bad_alloc: out_of_memory: RMM failure at:/usr/include/rmm/mr/device/pool_memory_resource.hpp:183: Maximum pool size exceeded')"

/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 9 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
mv: cannot stat '*9600.model': No such file or directory
I0811 12:55:07.662640 4614 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9d96000000' with size 268435456
I0811 12:55:07.663117 4614 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0811 12:55:07.665177 4614 model_repository_manager.cc:1191] loading: criteo:1
I0811 12:55:07.811332 4614 hugectr.cc:1738] TRITONBACKEND_Initialize: hugectr
I0811 12:55:07.811382 4614 hugectr.cc:1745] Triton TRITONBACKEND API version: 1.10
I0811 12:55:07.811395 4614 hugectr.cc:1749] 'hugectr' TRITONBACKEND API version: 1.10
I0811 12:55:07.811407 4614 hugectr.cc:1772] The HugeCTR backend Repository location: /opt/tritonserver/backends/hugectr
I0811 12:55:07.811424 4614 hugectr.cc:1781] The HugeCTR backend configuration: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","ps":"/tmp/model/ps.json","default-max-batch-size":"4"}}
I0811 12:55:07.811459 4614 hugectr.cc:345] *****Parsing Parameter Server Configuration from /tmp/model/ps.json
I0811 12:55:07.811551 4614 hugectr.cc:366] Support 64-bit keys = 1
I0811 12:55:07.811609 4614 hugectr.cc:591] Model name = criteo
I0811 12:55:07.811622 4614 hugectr.cc:600] Model 'criteo' -> network file = /tmp/model/criteo/1/criteo.json
I0811 12:55:07.811637 4614 hugectr.cc:607] Model 'criteo' -> max. batch size = 64
I0811 12:55:07.811648 4614 hugectr.cc:613] Model 'criteo' -> dense model file = /tmp/model/criteo/1/_dense_9600.model
I0811 12:55:07.811667 4614 hugectr.cc:619] Model 'criteo' -> sparse model files = [/tmp/model/criteo/1/0_sparse_9600.model]
I0811 12:55:07.811679 4614 hugectr.cc:630] Model 'criteo' -> use GPU embedding cache = 1
I0811 12:55:07.811711 4614 hugectr.cc:639] Model 'criteo' -> hit rate threshold = 0.9
I0811 12:55:07.811725 4614 hugectr.cc:647] Model 'criteo' -> per model GPU cache = 0.5
I0811 12:55:07.811749 4614 hugectr.cc:664] Model 'criteo' -> use_mixed_precision = 0
I0811 12:55:07.811761 4614 hugectr.cc:671] Model 'criteo' -> scaler = 1
I0811 12:55:07.811773 4614 hugectr.cc:677] Model 'criteo' -> use_algorithm_search = 1
I0811 12:55:07.811784 4614 hugectr.cc:685] Model 'criteo' -> use_cuda_graph = 1
I0811 12:55:07.811797 4614 hugectr.cc:692] Model 'criteo' -> num. pool worker buffers = 4
I0811 12:55:07.811809 4614 hugectr.cc:700] Model 'criteo' -> num. pool refresh buffers = 1
I0811 12:55:07.811821 4614 hugectr.cc:708] Model 'criteo' -> cache refresh rate per iteration = 0.2
I0811 12:55:07.811847 4614 hugectr.cc:717] Model 'criteo' -> deployed device list = [0]
I0811 12:55:07.811862 4614 hugectr.cc:725] Model 'criteo' -> default value for each table = [0, 0]
I0811 12:55:07.811872 4614 hugectr.cc:733] Model 'criteo' -> maxnum_des_feature_per_sample = 26
I0811 12:55:07.811883 4614 hugectr.cc:741] Model 'criteo' -> refresh_delay = 0
I0811 12:55:07.811893 4614 hugectr.cc:747] Model 'criteo' -> refresh_interval = 0
I0811 12:55:07.811906 4614 hugectr.cc:755] Model 'criteo' -> maxnum_catfeature_query_per_table_per_sample list = [2, 26]
I0811 12:55:07.811920 4614 hugectr.cc:766] Model 'criteo' -> embedding_vecsize_per_table list = [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
I0811 12:55:07.811931 4614 hugectr.cc:773] Model 'criteo' -> embedding model names = []
I0811 12:55:07.811941 4614 hugectr.cc:780] Model 'criteo' -> label_dim = 1
I0811 12:55:07.811951 4614 hugectr.cc:785] Model 'criteo' -> the number of slots = 10
I0811 12:55:07.811970 4614 hugectr.cc:806] *****The HugeCTR Backend Parameter Server is creating... *****
terminate called after throwing an instance of 'std::_Nested_exception<HugeCTR::internal_runtime_error>'
  what():  Runtime error: file_stream.is_open() failed: /tmp/model/criteo/1/criteo.json
    Error_t::FileCannotOpen at read_json_file(/hugectr/HugeCTR/include/parser.hpp:41)
karlhigley commented 2 years ago

This seems to have been a transient failure