agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

ProtElectra Tokenizer #68

Closed CeciLyu closed 2 years ago

CeciLyu commented 2 years ago

Hi Rostlab,

Thank you for providing the pretrained models! I am trying to use the ProtElectra model to extract features. However, I am confused by the tokenizer output.

test_seq = ['[PAD],[UNK],[CLS],[SEP],[MASK],L,A,G,V,E,S,I,K,R,D,T,P,N,Q,F,Y,M,H,C,W,X,U,B,Z,O'] test_seq[0] = test_seq[0].replace(',', ' ')[1:2*len(test_seq[0])]

tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_electra_discriminator_bfd", local_files_only=True) print(tokenizer(test_seq))

'input_ids': [2, 1, 1, 1, 2, 3, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]

I don't understand why different amino acid got the same id. Am I using the tokenizer wrongly?

Thanks, Suyue

mheinzinger commented 2 years ago

Hi Suyue,

hm, from my end your code snippet works as expected if I use ElectraTokenizer; so maybe try using the specific Tokenizer instead of the general AutoTokenizer. Alternatively/Additionally, try re-downloading the vocab.

from transformers import ElectraTokenizer vocab=ElectraTokenizer.from_pretrained("Rostlab/prot_electra_generator_bfd",do_lower_case=False) test_seq = ['[PAD],[UNK],[CLS],[SEP],[MASK],L,A,G,V,E,S,I,K,R,D,T,P,N,Q,F,Y,M,H,C,W,X,U,B,Z,O'] test_seq[0] = test_seq[0].replace(',', ' ')[1:2*len(test_seq[0])] print(vocab(test_seq)) {'input_ids': [[2, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}