epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
529 stars 76 forks source link

llama2 & vocabulary padding (making embedding layer sizes divisible by 128) #50

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

In weights2megatron.py for llama models args.make_vocab_size_divisible_by = 1 is set and this setting seems to be effective until conversion to hf (here are args of a 70b). This is different to falcon models which actually use a padded vocabulary with dummy tokens and this applies then also for the export to huggingface, e.g. see OpenAssistant/falcon-40b-megacode2-oasst/blob/main/config.json#L24.

Normally the size is padded to a value divisible by 128 to improve the model efficiency. Does this padding not have a beneficial effect for llama2? Was there a good reason not to use the megatron default value of 128?

AleHD commented 1 year ago

I don't think there was any specific reason for that, probably early testing for weight loading. Apparently it now actually breaks something else. I will make local tests to make sure setting it to 128 doesn't brake anything, in which case I would be happy to stick with 128.