Closed CeciLyu closed 2 years ago
Hi Suyue,
hm, from my end your code snippet works as expected if I use ElectraTokenizer; so maybe try using the specific Tokenizer instead of the general AutoTokenizer. Alternatively/Additionally, try re-downloading the vocab.
from transformers import ElectraTokenizer vocab=ElectraTokenizer.from_pretrained("Rostlab/prot_electra_generator_bfd",do_lower_case=False) test_seq = ['[PAD],[UNK],[CLS],[SEP],[MASK],L,A,G,V,E,S,I,K,R,D,T,P,N,Q,F,Y,M,H,C,W,X,U,B,Z,O'] test_seq[0] = test_seq[0].replace(',', ' ')[1:2*len(test_seq[0])] print(vocab(test_seq)) {'input_ids': [[2, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Hi Rostlab,
Thank you for providing the pretrained models! I am trying to use the ProtElectra model to extract features. However, I am confused by the tokenizer output.
test_seq = ['[PAD],[UNK],[CLS],[SEP],[MASK],L,A,G,V,E,S,I,K,R,D,T,P,N,Q,F,Y,M,H,C,W,X,U,B,Z,O'] test_seq[0] = test_seq[0].replace(',', ' ')[1:2*len(test_seq[0])]
tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_electra_discriminator_bfd", local_files_only=True) print(tokenizer(test_seq))
'input_ids': [2, 1, 1, 1, 2, 3, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]
I don't understand why different amino acid got the same id. Am I using the tokenizer wrongly?
Thanks, Suyue