Thanks a lot for sharing the code. I'm trying to reproduce the pre-training on CC12M using 4 H100 but I'm getting a negative loss after training for a while (see the screenshot). Have you also observed this? Thanks in advance!
Yes, this is normal. In short, this "loss" is only a term that facilitates computing the gradient estimator in compositional optimization, and it is not the value of the loss function.
Hi,
Thanks a lot for sharing the code. I'm trying to reproduce the pre-training on CC12M using 4 H100 but I'm getting a negative loss after training for a while (see the screenshot). Have you also observed this? Thanks in advance!