Add a pad-vocab-size-to argument in order to let the possibility to the user to specify the wanted tokenizer vocabulary size - and not use the automatic feature computing a vocab size compatible with make_vocab_size_divisible_by and the tensor Parallelism value.
We also took advantage of this new feature to add tests and an assert in the code that verifies that the input ids cannot be outside the admitted input ids.
Looks good to me. At some point we should probably factor out all of the pool/process launch stuff to a common place but that's for when things have calmed down a bit
Add a
pad-vocab-size-to
argument in order to let the possibility to the user to specify the wanted tokenizer vocabulary size - and not use the automatic feature computing a vocab size compatible withmake_vocab_size_divisible_by
and the tensor Parallelism value.We also took advantage of this new feature to add tests and an assert in the code that verifies that the input ids cannot be outside the admitted input ids.