I was able to successfully train the Mixtral code on a single node with 8 GPUs by reducing the size, but when I switched to multiple nodes, I noticed that the loss per iteration does not decrease compared to the single node setup. Is there something wrong with the multi-node configuration?
I was able to successfully train the Mixtral code on a single node with 8 GPUs by reducing the size, but when I switched to multiple nodes, I noticed that the loss per iteration does not decrease compared to the single node setup. Is there something wrong with the multi-node configuration?