MinishLab / model2vec

Distill a Small Static Model from any Sentence Transformer
https://minishlab.github.io/
MIT License
413 stars 18 forks source link

Fix distill model bos and eos token #78

Closed zechengz closed 1 month ago

zechengz commented 1 month ago

The bos_token_id and eos_token_id appear to be reversed in the create_output_embeddings_from_model_name function within model2vec/distill/inference.py.

When I testing with following code

from transformers import AutoModel, AutoTokenizer
from model2vec.distill import distill_from_model
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)

before the fix bos_token_id is 102 and eos_token_id is 101 but for BERT tokenizer the correct bos_token_id and eos_token_id should be

>>> tokenizer.cls_token_id
101
>>> tokenizer.sep_token_id
102