I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.
This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).
I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.
I am attempting to train a model with 3 billion parameters on two A100 GPUs using
nvidia-tensorflow 1.15 (21.07-tf1-py3)
, with a batch size of 24 andtf.distribute.MirroredStrategy
.The error message is:
This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).
I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.