Unable to replicate the results using the default command

shiwenqin commented 8 months ago

Hi,

Thank you for this amazing repository. I am trying to replicate your model by running the default command in README

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data=pile-readymade

and

python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False

The only change I made to the above command is adding 'budget=24' to the training command.

I train the model for 24hrs on 1 A100 40G GPU, but the average GLUE is only 0.73, based on your paper I assume it should be somewhere between 0.792 (A4000) and 0.804 (A6000). The installation of the repository are done in a fresh conda environment, I only made three change to the code, which are the change mentioned in #38 , #44 and wandb configs.

Below is the attached wandb log for the pre-training loss, the loss ends in 2.973 and the curve does not looks right. wandb_log

Could you guide me on what might be the problem? I am happy to provide any further information you need.

Thanks so much for the help!

JonasGeiping commented 8 months ago

Hm those values do look unusual. I'll have to find time to rebuild my own env with modern package versions and check if I can reproduce this. In the meantime,

Your data is wrapping around around step 650k, as the card is too fast :). The canonical fix would be to generate more data, but with the-pile taken offline, this also would not be a straightforward comparison. Maybe generating more data with OSCAR would do the trick. Alternatively, you can shuffle, which will at least fix the bump.
A potential culprit on untested GPUs like yours are torch.compile settings. You could try training with compilation turned off to check if anything is different.
If it's none of these, then it is likely a problem with a more modern version of some package. This will be a bit harder to fix.

thuwzt commented 8 months ago

Hi Jonas,

I encountered the same issue when executing the default pretraining command on my NVIDIA 4090 GPU.

Following your guidance, I disabled torch.compile, which resulted in a more promising loss curve, culminating in a final loss of 2.111. However, the curve still exhibits periodic spikes that appear unusual.

Given these circumstances, do you have any insights into what might be causing these issues? Also, if it's not too much trouble, would it be possible for you to share an example of the expected loss curve for comparison? I greatly appreciate your assistance and support in resolving these problems.

shiwenqin commented 8 months ago

Hi,

I took your advice and disabled torch.compile and all other settings unchanged. The result loss curve is very similar to what @thuwzt reported, and GLUE performance is improved to 0.79.

Based on your paper, the MLM loss should be at least below 1.9, which means there still exits some problems. Could you share a tested env with all the package versions?

Thank you very much for the support!

JonasGeiping commented 7 months ago

Hi, just as a temporary notice. Things are moving slowly, but moving. I was able to make some time to find the cause of this myself and I am finally running my own tests against modern torch versions.

There is also some parallel investigation happening upstream here: https://github.com/pytorch/pytorch/issues/96693

JonasGeiping commented 7 months ago

Ok, I am now able to say a bit more. I think there are multiple things coming together here.

I can reproduce a good (80+) model, with torch.compile turned on and the original hand-made compile settings (which I've now re-enabled as defaults in the repo). For reference, the model report is here: https://api.wandb.ai/links/jonasgeiping/u6uu6cpp, and was run with a starndard PyTorch environment (which I've included here: https://github.com/JonasGeiping/cramming/blob/main/environment.yml)
But, I can also reproduce that the combination of premade data + default compile settings sometimes leads to curves as you experienced above. Will investigate the compile options further.
A second problem is that you're both running out of data. I have uploaded the last two tokenized Pile datasets that I had left lying around now, which have higher token counts. You can find these here (https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_53b28db05413b6497e702f178268e1e2) and here (https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_e9f3c90fb38fb46185ad86ed3b69b9d5). I would use these larger slices, if you'd otherwise run out of data and loop again (The epoch count should be zero in wandb).
A loss of around 2.0 is actually ok for the pile data. In hindsight, the paper could have been a bit clearer about this, but the pretraining figures that show MLM loss are based on the default setup (detailed in Sec. 4.1), so based on bookcorpus-wikipedia data, where 1.8-1.9 MLM is reachable.
A minor issue is that you might not be modifying the microbatch size to fit your card, resulting in lower tokens/sec than expected for these strong cards. After the initial stages of training tok/sec should be upward of 100k (see the report for details).

I'm still waiting for a few more models to queue and run, and might have more answers about torch.compile then, but for now, this might be helpful.

shiwenqin commented 7 months ago

Thanks for your help! The suggestions are helpful, I can now achieve loss of 1.971 on A5500 GPU for 24 hours. Using dataset "e9f3c90fb38fb46185ad86ed3b69b9d5" and seed=32, the final glue score is 80.3. The loss curve with torch.compile turned on is not very stable, but I guess it's normal given that the log you shared have similar loss curve?

JonasGeiping commented 7 months ago

Yeah, this is ok. There should also be compile settings where the curve is smoother and still fast, but I have not fully identified them yet. It's lower priority though, now that we're sure that everything does work, in principle.

P.S: Make sure to report GLUE as average over the 5-trial medians of all downstream tasks in the end.

JonasGeiping commented 7 months ago

Closing this for now then, feel free to reopen if any questions come up!

Doraemonzzz commented 6 months ago

Hi, Jonas, I noticed in the log you shared that the microbatch_size is 512, but according to the experiment run by the following command, the microbatch_size is 128. Is this a bug?

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data=pile-readymade

JonasGeiping commented 6 months ago

Only a bug in the sense that the documentation should be clearer about it.

The default mbs is set to 128 so the code runs immediately on most GPUs, but in general, I don't know what GPUs will run this code, and I am always under the assumption that people would set a mbs that would saturate their card, to get the most out of their card. For the A6000 card I used for this run, 512 is close to saturating the GPU.

The true batch size is defined in train.batch_size and is independent of the MBS, so changes to impl.microbatch_size are, like the other settings in impl, only affecting the implementation of the recipe, but not the recipe itself. The recipe is still being run correctly, the available card is just not used as optimally as it could have been.

All that being said, the docs should say just this much more clearly...

Doraemonzzz commented 6 months ago

Hi, Jonas, thank you for your reply. I would like to confirm one more thing: the microbatch_size=128 and microbatch_size=512 actually only affect the training speed? Since the tokens consumed per update are specified by batch_size=8192, the final loss between the two should be similar?

JonasGeiping commented 6 months ago

Yes, if you run for a fixed number of tokens. By default the code runs the 24h cramming setting, where a more efficient use of the GPU does lead to improvements.

Doraemonzzz commented 6 months ago

Hi, Jonas, thank you again for your response. I would like to ask another question: is it possible to set max-updates? Because I want to compare the results of different methods when consuming the same number of tokens.

JonasGeiping commented 6 months ago

Just set a very large budget, a finite number of train.steps and switch the scheduler to a non-budget version (by removing budget- from the name of the scheduler). Regarding our discussion of batch size, you also have to make sure to compare with equal MBS, because train.steps counts micro-batch steps (but you could also simply divide out the change).

With these small tweaks you can run for a fixed token budget.

Doraemonzzz commented 6 months ago

Thank you very much, I will try this.

JonasGeiping / cramming

Unable to replicate the results using the default command #45