Non-OK-status: GpuLaunchKernel error during distributed training of a large model

I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.

The error message is:

2023-06-03 07:27:26.364872: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:161] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast(output->dimension(0)), static_cast(output->dimension(1)), output->data()) status: Internal: invalid configuration argument

This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).

I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.

NVIDIA / tensorflow

Non-OK-status: GpuLaunchKernel error during distributed training of a large model #88