NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
946 stars 200 forks source link

[BUG] EmbeddingCollection Wgrad buffer sizes can overflow a 32 bit integer. #382

Closed zpzim closed 1 year ago

zpzim commented 1 year ago

Describe the bug

Because max_buffer_size is stored as a 32-bit integer in the following code snippet: https://github.com/NVIDIA-Merlin/HugeCTR/blob/772fd505e3652ce143be2ee83f025dc36cf16e89/HugeCTR/embedding/common.cpp#L386-L399

It can easily lead to an overflow of the value when the batch size is large. This can result in a negative value being passed to the HugeCTR memory allocator, which gets interpreted as an unsigned integer by cudaMalloc, instantly triggering an OOM.

Expected behavior This value should be stored as a 64-bit integer.

zpzim commented 1 year ago

Actually this same issue is present for the AllreduceWgradInitializer::init_data() method.

https://github.com/NVIDIA-Merlin/HugeCTR/blob/772fd505e3652ce143be2ee83f025dc36cf16e89/HugeCTR/embedding/common.cpp#L496

shijieliu commented 1 year ago

Thanks for the feedback! We will check and fix it.

RayWang96 commented 1 year ago

The fix is in the main branch, @zpzim could you verify it? Thanks.