ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes
MIT License
1 stars 0 forks source link

Remove vocab constraints for tokenizer #98

Closed ayushkarnawat closed 4 years ago

ayushkarnawat commented 4 years ago

What does this PR do?

For certain vocabs, having a padding and unknown token is not necessary. This will now allow users to still define a valid tokenizer that will encode and decode sequences, even without those tokens. NOTE: Although those tokens are not required anymore, they are still useful and should still be used.

The primary reason for this PR is to allow for the use of the AA20_ONLY vocab to be used for downstream tasks such as when generating valid new sequences or when defining a kernel to use with a gaussian process regressor.

Changes