ProtElectra Tokenizer - Githubissues

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.13k stars 153 forks source link

Hi Suyue,

hm, from my end your code snippet works as expected if I use ElectraTokenizer; so maybe try using the specific Tokenizer instead of the general AutoTokenizer. Alternatively/Additionally, try re-downloading the vocab.

from transformers import ElectraTokenizer vocab=ElectraTokenizer.from_pretrained("Rostlab/prot_electra_generator_bfd",do_lower_case=False) test_seq = ['[PAD],[UNK],[CLS],[SEP],[MASK],L,A,G,V,E,S,I,K,R,D,T,P,N,Q,F,Y,M,H,C,W,X,U,B,Z,O'] test_seq[0] = test_seq[0].replace(',', ' ')[1:2*len(test_seq[0])] print(vocab(test_seq)) {'input_ids': [[2, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

agemagician / ProtTrans

ProtElectra Tokenizer #68