For certain vocabs, having a padding and unknown token is not necessary. This will now allow users to still define a valid tokenizer that will encode and decode sequences, even without those tokens. NOTE: Although those tokens are not required anymore, they are still useful and should still be used.
The primary reason for this PR is to allow for the use of the AA20_ONLY vocab to be used for downstream tasks such as when generating valid new sequences or when defining a kernel to use with a gaussian process regressor.
Changes
Fix pylon erros in tokenizers and vocabs
More accurate + useful vocab types
Remove constraint for / to be in every vocab (prev. required for tokenizers to work)
What does this PR do?
For certain vocabs, having a padding and unknown token is not necessary. This will now allow users to still define a valid tokenizer that will encode and decode sequences, even without those tokens. NOTE: Although those tokens are not required anymore, they are still useful and should still be used.
The primary reason for this PR is to allow for the use of the
AA20_ONLY
vocab to be used for downstream tasks such as when generating valid new sequences or when defining a kernel to use with a gaussian process regressor.Changes