NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
928 stars 199 forks source link

[BUG]Can NOT run wdl_parquet.py: CUDNN_STATUS_MAPPING_ERROR #389

Closed butterluo closed 9 months ago

butterluo commented 1 year ago

Describe the bug When I run samples/wdl/wdl_parquet.py according to its Readme.md, an Error thowen: terminate called after throwing an instance of 'std::_Nested_exception' what(): Runtime error: CUDNN_STATUS_MAPPING_ERROR cudnnSetStream(cudnnhandle, current_stream) at set_stream(/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/include/gpu_resource.hpp:79) [b:117125] Process received signal [b:117125] Signal: Aborted (6) [b:117125] Signal code: (-6) [b:117125] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb889e50090] [b:117125] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fb889e5000b] [b:117125] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fb889e2f859] [b:117125] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fb879f4c911] [b:117125] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fb879f5838c] [b:117125] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7fb879f57369] [b:117125] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x2a1)[0x7fb879f57d21] [b:117125] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7fb879ea3bef] [b:117125] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fb879ea45aa] [b:117125] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR11GPUResource10set_streamERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x28e)[0x7fb87b7c8b9c] [b:117125] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR13StreamContextD1Ev+0x37)[0x7fb87bd4bfa7] [b:117125] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR25StreamContextScheduleable3runESt10shared_ptrINS_11GPUResourceEEbb+0x548)[0x7fb87bd4b2aa] [b:117125] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x1cae798)[0x7fb87bd4b798] [b:117125] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x1caeae2)[0x7fb87bd4bae2] [b:117125] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZNKSt8functionIFvP11CUstreamstEEclES1+0x4d)[0x7fb87b7cbdaf] [b:117125] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR12GraphWrapper7captureESt8functionIFvP11CUstreamstEES3+0xe8)[0x7fb87b7cb8dc] [b:117125] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR8Pipeline9run_graphEv+0xe7)[0x7fb87bd4b8d9] [b:117125] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x1d73666)[0x7fb87be10666] [b:117125] [18] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7fb8086038e6] [b:117125] [19] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model5trainEb+0x500)[0x7fb87be05fa0] [b:117125] [20] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model3fitEiiiiiNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x26f5)[0x7fb87be03df3] [b:117125] [21] /usr/local/hugectr/lib/hugectr.so(+0x3e42a3)[0x7fb8895d22a3] [b:117125] [22] /usr/local/hugectr/lib/hugectr.so(+0x4cbc73)[0x7fb8896b9c73] [b:117125] [23] /usr/local/hugectr/lib/hugectr.so(+0x48873e)[0x7fb88967673e] [b:117125] [24] /usr/local/hugectr/lib/hugectr.so(+0x433e38)[0x7fb889621e38] [b:117125] [25] /usr/local/hugectr/lib/hugectr.so(+0x43420e)[0x7fb88962220e] [b:117125] [26] /usr/local/hugectr/lib/hugectr.so(+0x35e190)[0x7fb88954c190] [b:117125] [27] python3(PyCFunction_Call+0x59)[0x5f5b39] [b:117125] [28] python3(_PyObject_MakeTpCall+0x296)[0x5f6706] [b:117125] [29] python3[0x50ba83] [b:117125] End of error message

To Reproduce Steps to reproduce the behavior:

  1. Build HugeCTR trainig from source(from tag/v23.02.00) according to https://nvidia-merlin.github.io/HugeCTR/main/hugectr_contributor_guide.html
  2. Run /wdl_parquet.py according to samples/wdl/README.md

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here. Below are the call stack when ERROR happen. I found that the cudnnSetStream() will be call many times, but only throws ERROR when capturing Graph. CallStack: libc.so.6!raise (Unknown Source:0) libc.so.6!abort (Unknown Source:0) libstdc++.so.6![Unknown/Just-In-Time compiled code] (Unknown Source:0) libstdc++.so.6!gxx_personality_v0 (Unknown Source:0) libgcc_s.so.1![Unknown/Just-In-Time compiled code] (Unknown Source:0) libgcc_s.so.1!_Unwind_Resume (Unknown Source:0) libhuge_ctr_shared.so!HugeCTR::GPUResource::set_stream(HugeCTR::GPUResource const this, const std::string & name, int priority) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/include/gpu_resource.hpp:79) libhuge_ctr_shared.so!HugeCTR::StreamContext::~StreamContext(HugeCTR::StreamContext const this) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/include/gpu_resource.hpp:114) libhuge_ctr_shared.so!HugeCTR::StreamContextScheduleable::run(HugeCTR::StreamContextScheduleable const this, std::shared_ptr gpu, bool use_graph, bool prepare_resource) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pipeline.cpp:72) libhuge_ctr_shared.so!HugeCTR::Pipeline::<lambda(cudaStream_t)>::operator()(cudaStream_t) const(const HugeCTR::Pipeline::<lambda(cudaStream_t)> const closure) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pipeline.cpp:112) libhuge_ctr_shared.so!std::_Function_handler<void(CUstream_st), HugeCTR::Pipeline::run_graph()::<lambda(cudaStream_t)> >::_M_invoke(const std::_Any_data &, CUstream_st &&)(const std::_Any_data & functor, args#0) (/usr/include/c++/9/bits/std_function.h:300) libhuge_ctr_shared.so!std::function<void (CUstream_st)>::operator()(CUstream_st) const(const std::function<void(CUstream_st)> const this, args#0) (/usr/include/c++/9/bits/std_function.h:688) libhuge_ctr_shared.so!HugeCTR::GraphWrapper::capture(std::function<void (CUstream_st)>, CUstream_st)(HugeCTR::GraphWrapper const this, std::function<void(CUstream_st)> workload, cudaStream_t stream) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/graph_wrapper.cpp:31) libhuge_ctr_shared.so!HugeCTR::Pipeline::run_graph(HugeCTR::Pipeline const this) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pipeline.cpp:118) libhuge_ctr_shared.so!HugeCTR::Model::_ZN7HugeCTR5Model5trainEb._omp_fn.1(void)() (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pybind/model.cpp:2393) libgomp.so.1!GOMP_parallel (Unknown Source:0) libhuge_ctr_shared.so!HugeCTR::Model::train(HugeCTR::Model const this, bool is_first_batch) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pybind/model.cpp:2387) libhuge_ctr_shared.so!HugeCTR::Model::fit(HugeCTR::Model * const this, int num_epochs, int max_iter, int display, int eval_interval, int snapshot, std::string snapshot_prefix) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/HugeCTR/src/pybind/model.cpp:2185) hugectr.so!pybind11::cpp_function::cpp_function<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)#1}::operator()(HugeCTR::Model*, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >) const(const pybind11::cpp_function::<lambda(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const this, HugeCTR::Model c, args#0, args#1, args#2, args#3, args#4, args#5) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/pybind11.h:110) hugectr.so!pybind11::detail::argument_loader<HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > >::call_impl<void, pybind11::cpp_function::cpp_function<void, HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::*)(int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<void, HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul>, pybind11::detail::void_type&&) &&(pybind11::detail::argument_loader<HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > > const this, pybind11::cpp_function::<lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)> & f) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/cast.h:1448) hugectr.so!pybind11::detail::argument_loader<HugeCTR::Model*, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::call<void, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&>(pybind11::cpp_function::cpp_function<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&) &&(pybind11::detail::argument_loader<HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > const this, pybind11::cpp_function::<lambda(HugeCTR::Model*, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)> & f) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/cast.h:1422) hugectr.so!pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)#1}, void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(pybind11::cpp_function::initialize<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&&, void ()(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const(const pybind11::cpp_function::<lambda(pybind11::detail::function_call&)> const this, pybind11::detail::function_call & call) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/pybind11.h:248) hugectr.so!pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::*)(int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)#1}, void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(pybind11::cpp_function::initialize<void, HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(void (HugeCTR::Model::)(int, int, int, int, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >)#1}&&, void ()(HugeCTR::Model, int, int, int, int, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)() (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/pybind11.h:223) hugectr.so!pybind11::cpp_function::dispatcher(PyObject self, PyObject args_in, PyObject * kwargs_in) (/home/SRC/MLSys/RecsysAdSrch/Merlin/HugeCTR/build/_deps/pybind11_sources-src/include/pybind11/pybind11.h:939) PyCFunction_Call (Unknown Source:0) _PyObject_MakeTpCall (Unknown Source:0) [Unknown/Just-In-Time compiled code] (Unknown Source:0) _PyEval_EvalFrameDefault (Unknown Source:0) _PyEval_EvalCodeWithName (Unknown Source:0) PyEval_EvalCode (Unknown Source:0) [Unknown/Just-In-Time compiled code] (Unknown Source:0) PyRun_SimpleFileExFlags (Unknown Source:0) Py_RunMain (Unknown Source:0) Py_BytesMain (Unknown Source:0) libc.so.6!__libc_start_main (Unknown Source:0)

JacoCheung commented 1 year ago

Did you buid with DCMAKE_BUILD_TYPE =Debug or Release?

zpcalan commented 1 year ago

Same issue encountered here. CUDA runtime API doc says:

CUDNN_STATUS_MAPPING_ERROR
Mismatch between the user stream and the cuDNN handle context.

But I check the code and add some extra logs. current_stream is created right after cudnn_handle_ is created. I can't find out any reason why they are not matched.

Code in HugeCTR/include/gpu_resource.hpp

void set_stream(const std::string& name, int priority = 0) {
    // Get and create stream here.
    cudaStream_t current_stream =
        stream_event_manager_.get_stream(name, cudaStreamNonBlocking, priority);
    stream_name_ = name;
    HCTR_LIB_THROW(curandSetStream(replica_variant_curand_generator_, current_stream));
    HCTR_LIB_THROW(cublasSetStream(cublas_handle_, current_stream));
    HCTR_LIB_THROW(cudnnSetStream(cudnn_handle_, current_stream));
  }
JacoCheung commented 1 year ago

Hi @zpcalan, we have encountered this CUDNN error when building with Debug mode before 22.05 release, and we have fixed it in 22.05. So I'd like to know how did you build the source codes.

Can you try the latest image(22.06) or build HugeCTR with Release mode with 22.02 release?

zpcalan commented 1 year ago

Hi @zpcalan, we have encountered this CUDNN error when building with Debug mode before 22.05 release, and we have fixed it in 22.05. So I'd like to know how did you build the source codes.

Can you try the latest image(22.06) or build HugeCTR with Release mode with 22.02 release?

Do you mean 23.06? I was using 23.02 image and its built-in HugeCTR when encountering this issue. So I did not build source codes. But never mind, I change some hyper parameters like slot array size and workspace size and error disappeared. It could be some gpu memory issue.

JacoCheung commented 1 year ago

Glad to hear that. Out of curiosity, can you help confirm if this error does happen in 23.06 if you don't chagne any hyper-params?

zpcalan commented 1 year ago

OK, I'll try it later.