ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes
MIT License
1 stars 0 forks source link

<pad> and <unk> tokens should not be required #96

Closed ayushkarnawat closed 4 years ago

ayushkarnawat commented 4 years ago

Instead of using a hack to not predict certain vocab at all, it might be better to remove them from the dictionary altogether. This is not only more robust, but also leads to more accurate predictions since we are simply not taking out the predicted sequences values.

_Originally posted by @ayushkarnawat in https://github.com/ayushkarnawat/profit/pull/95#discussion_r399429514

ayushkarnawat commented 4 years ago

It is important to note that to generate plausible sequences to work with how the tokenizer/vocabs work, we remove the unnecessary vocab items not found in the original training dataset. This helps "steer" the softmax predictions to pick a vocab out of the ones in the original dataset.