Closed shiwenqin closed 7 months ago
Hm those values do look unusual. I'll have to find time to rebuild my own env with modern package versions and check if I can reproduce this. In the meantime,
Your data is wrapping around around step 650k, as the card is too fast :). The canonical fix would be to generate more data, but with the-pile taken offline, this also would not be a straightforward comparison. Maybe generating more data with OSCAR would do the trick. Alternatively, you can shuffle, which will at least fix the bump.
A potential culprit on untested GPUs like yours are torch.compile
settings. You could try training with compilation turned off to check if anything is different.
If it's none of these, then it is likely a problem with a more modern version of some package. This will be a bit harder to fix.
Hi Jonas,
I encountered the same issue when executing the default pretraining command on my NVIDIA 4090 GPU.
Following your guidance, I disabled torch.compile
, which resulted in a more promising loss curve, culminating in a final loss of 2.111. However, the curve still exhibits periodic spikes that appear unusual.
Given these circumstances, do you have any insights into what might be causing these issues? Also, if it's not too much trouble, would it be possible for you to share an example of the expected loss curve for comparison? I greatly appreciate your assistance and support in resolving these problems.
Hi,
I took your advice and disabled torch.compile
and all other settings unchanged. The result loss curve is very similar to what @thuwzt reported, and GLUE performance is improved to 0.79.
Based on your paper, the MLM loss should be at least below 1.9, which means there still exits some problems. Could you share a tested env with all the package versions?
Thank you very much for the support!
Hi, just as a temporary notice. Things are moving slowly, but moving. I was able to make some time to find the cause of this myself and I am finally running my own tests against modern torch versions.
There is also some parallel investigation happening upstream here: https://github.com/pytorch/pytorch/issues/96693
Ok, I am now able to say a bit more. I think there are multiple things coming together here.
torch.compile
turned on and the original hand-made compile settings (which I've now re-enabled as defaults in the repo). For reference, the model report is here: https://api.wandb.ai/links/jonasgeiping/u6uu6cpp, and was run with a starndard PyTorch environment (which I've included here: https://github.com/JonasGeiping/cramming/blob/main/environment.yml)bookcorpus-wikipedia
data, where 1.8-1.9 MLM is reachable.I'm still waiting for a few more models to queue and run, and might have more answers about torch.compile
then, but for now, this might be helpful.
Thanks for your help! The suggestions are helpful, I can now achieve loss of 1.971 on A5500 GPU for 24 hours. Using dataset "e9f3c90fb38fb46185ad86ed3b69b9d5" and seed=32, the final glue score is 80.3. The loss curve with torch.compile turned on is not very stable, but I guess it's normal given that the log you shared have similar loss curve?
Yeah, this is ok. There should also be compile settings where the curve is smoother and still fast, but I have not fully identified them yet. It's lower priority though, now that we're sure that everything does work, in principle.
P.S: Make sure to report GLUE as average over the 5-trial medians of all downstream tasks in the end.
Closing this for now then, feel free to reopen if any questions come up!
Hi, Jonas, I noticed in the log you shared that the microbatch_size is 512, but according to the experiment run by the following command, the microbatch_size is 128. Is this a bug?
python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
Only a bug in the sense that the documentation should be clearer about it.
The default mbs is set to 128 so the code runs immediately on most GPUs, but in general, I don't know what GPUs will run this code, and I am always under the assumption that people would set a mbs that would saturate their card, to get the most out of their card. For the A6000 card I used for this run, 512 is close to saturating the GPU.
The true batch size is defined in train.batch_size
and is independent of the MBS, so changes to impl.microbatch_size
are, like the other settings in impl
, only affecting the implementation of the recipe, but not the recipe itself. The recipe is still being run correctly, the available card is just not used as optimally as it could have been.
All that being said, the docs should say just this much more clearly...
Hi, Jonas, thank you for your reply. I would like to confirm one more thing: the microbatch_size=128 and microbatch_size=512 actually only affect the training speed? Since the tokens consumed per update are specified by batch_size=8192
, the final loss between the two should be similar?
Yes, if you run for a fixed number of tokens. By default the code runs the 24h cramming setting, where a more efficient use of the GPU does lead to improvements.
Hi, Jonas, thank you again for your response. I would like to ask another question: is it possible to set max-updates? Because I want to compare the results of different methods when consuming the same number of tokens.
Just set a very large budget, a finite number of train.steps
and switch the scheduler to a non-budget version (by removing budget-
from the name of the scheduler). Regarding our discussion of batch size, you also have to make sure to compare with equal MBS, because train.steps
counts micro-batch steps (but you could also simply divide out the change).
With these small tweaks you can run for a fixed token budget.
Thank you very much, I will try this.
Hi,
Thank you for this amazing repository. I am trying to replicate your model by running the default command in README
and
The only change I made to the above command is adding 'budget=24' to the training command.
I train the model for 24hrs on 1 A100 40G GPU, but the average GLUE is only 0.73, based on your paper I assume it should be somewhere between 0.792 (A4000) and 0.804 (A6000). The installation of the repository are done in a fresh conda environment, I only made three change to the code, which are the change mentioned in #38 , #44 and wandb configs.
Below is the attached wandb log for the pre-training loss, the loss ends in 2.973 and the curve does not looks right.
Could you guide me on what might be the problem? I am happy to provide any further information you need.
Thanks so much for the help!