Closed pascalnotin closed 10 months ago
We could use the ESM tokenizer? Unless people have strong preferences.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
Ok sounds good to me @jamaliki -- but this makes me remember we will need to be careful with preprocessing since they seem to have kept indeterminate AAs (X,B,Z) and rare AAs (eg selenocysteine U) as part of the tokenizer vocabulary (https://huggingface.co/facebook/esm2_t33_650M_UR50D/resolve/main/vocab.txt) -- and we may want to remove these (we can discuss separately).
Yes, we should discuss this. It is quite easy to change as well.
Closing this issue post merge. Thank you @jamaliki !
Using HF tokenizer to facilitate compatibility with base model class