The bos_token_id and eos_token_id appear to be reversed in the create_output_embeddings_from_model_name function within model2vec/distill/inference.py.
When I testing with following code
from transformers import AutoModel, AutoTokenizer
from model2vec.distill import distill_from_model
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)
before the fix bos_token_id is 102 and eos_token_id is 101 but for BERT tokenizer the correct bos_token_id and eos_token_id should be
The
bos_token_id
andeos_token_id
appear to be reversed in thecreate_output_embeddings_from_model_name
function withinmodel2vec/distill/inference.py
.When I testing with following code
before the fix
bos_token_id
is102
andeos_token_id
is101
but for BERT tokenizer the correctbos_token_id
andeos_token_id
should be