It can easily lead to an overflow of the value when the batch size is large. This can result in a negative value being passed to the HugeCTR memory allocator, which gets interpreted as an unsigned integer by cudaMalloc, instantly triggering an OOM.
Expected behavior
This value should be stored as a 64-bit integer.
Describe the bug
Because
max_buffer_size
is stored as a 32-bit integer in the following code snippet: https://github.com/NVIDIA-Merlin/HugeCTR/blob/772fd505e3652ce143be2ee83f025dc36cf16e89/HugeCTR/embedding/common.cpp#L386-L399It can easily lead to an overflow of the value when the batch size is large. This can result in a negative value being passed to the HugeCTR memory allocator, which gets interpreted as an unsigned integer by cudaMalloc, instantly triggering an OOM.
Expected behavior This value should be stored as a 64-bit integer.