OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Instantiate AA tokenizer (HF) #3

Closed pascalnotin closed 10 months ago

pascalnotin commented 11 months ago

Using HF tokenizer to facilitate compatibility with base model class

jamaliki commented 11 months ago

We could use the ESM tokenizer? Unless people have strong preferences.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
pascalnotin commented 11 months ago

Ok sounds good to me @jamaliki -- but this makes me remember we will need to be careful with preprocessing since they seem to have kept indeterminate AAs (X,B,Z) and rare AAs (eg selenocysteine U) as part of the tokenizer vocabulary (https://huggingface.co/facebook/esm2_t33_650M_UR50D/resolve/main/vocab.txt) -- and we may want to remove these (we can discuss separately).

jamaliki commented 11 months ago

Yes, we should discuss this. It is quite easy to change as well.

pascalnotin commented 10 months ago

Closing this issue post merge. Thank you @jamaliki !