Closed andreaskoepf closed 1 year ago
I don't think there was any specific reason for that, probably early testing for weight loading. Apparently it now actually breaks something else. I will make local tests to make sure setting it to 128 doesn't brake anything, in which case I would be happy to stick with 128.
In weights2megatron.py for llama models
args.make_vocab_size_divisible_by = 1
is set and this setting seems to be effective until conversion to hf (here are args of a 70b). This is different to falcon models which actually use a padded vocabulary with dummy tokens and this applies then also for the export to huggingface, e.g. see OpenAssistant/falcon-40b-megacode2-oasst/blob/main/config.json#L24.Normally the size is padded to a value divisible by 128 to improve the model efficiency. Does this padding not have a beneficial effect for llama2? Was there a good reason not to use the megatron default value of 128?