Tiiiger / bert_score

BERT score for text generation
MIT License
1.5k stars 206 forks source link

is tokenizer max length correct? #182

Open ruiguo-bio opened 2 months ago

ruiguo-bio commented 2 months ago

If I use distilbert-base-uncased model trans_version 4.40 It will have max_length 1000000000000000019884624838656 in the utils.py line 216

DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={ 0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), }

hemengjita commented 1 month ago

same question!! hhh