UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.88k stars 2.44k forks source link

unsupervised learning -tsda #894

Open ReySadeghi opened 3 years ago

ReySadeghi commented 3 years ago

Hi, I used TSDA method to pretrain a BERT model on a corpus of sentences and I got this error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

and then used CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM] to trace the error and got this:

RuntimeError: CUDA error: device-side assert triggered

any help?

nreimers commented 3 years ago

Looks like some issue with CUDA. Don't know how to fix it

kwang2049 commented 3 years ago

Hi ReySadeghi, could you please run on CPU and see whether there is still a problem?

ReySadeghi commented 3 years ago

Hi ReySadeghi, could you please run on CPU and see whether there is still a problem?

Hi, in one case I tried and Got this error: indexerror: list index out of range python

and in another cases that I tried, RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED is still remain.

kwang2049 commented 3 years ago

Could you please paste here the whole training script and also the whole log?

ReySadeghi commented 3 years ago

Could you please paste here the whole training script and also the whole log?

training script:

from sentence_transformers import SentenceTransformer, LoggingHandler from sentence_transformers import models, util, datasets, evaluation, losses from torch.utils.data import DataLoader

import nltk

vocab=[] with open('vocab30k.txt', mode='r',encoding="utf8",errors='ignore') as file2: for line2 in file2: line2=line2.split('\n')[0] line2=line2.strip() vocab.append(line2)

vocab=vocab[:10000]

model_name = 'HooshvareLab/bert-fa-base-uncased' word_embedding_model = models.Transformer(model_name,max_seq_length=250)

word_embedding_model.tokenizer.add_tokens(vocab) word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False) model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[] with open('fa5M_shuffeled.txt', mode='r',encoding="utf8",errors='ignore') as file2: for line2 in file2: line2=line2.split('\n')[0] line2=line2.strip() train_sentences.append(line2)

train_sentences=train_sentences[:2000000]

train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, weight_decay=0, scheduler='constantlr', optimizer_params={'lr': 3e-5}, show_progress_bar=True )

.................................................. my coda version: 11.3

the Error:

lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError. warnings.warn(msg) Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

nreimers commented 3 years ago

Does it work when you use bert-base-uncased?

Also check that you have a recent version of Pytorch and transformers

ReySadeghi commented 3 years ago

Does it work when you use bert-base-uncased?

Also check that you have a recent version of Pytorch and transformers

I edited it, actually the model name is 'HooshvareLab/bert-fa-base-uncased'.

kwang2049 commented 3 years ago

Thanks for reporting this issue! We have located the bug: When one adds tokens to the encoder's lookup table, the _tie_encoder_decoder_weights function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder from encoder.config._name_or_path if tie_encoder_decoder=True and will contain more checking.

ReySadeghi commented 3 years ago

Thanks for reporting this issue! We have located the bug: When one adds tokens to the encoder's lookup table, the _tie_encoder_decoder_weights function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder from encoder.config._name_or_path if tie_encoder_decoder=True and will contain more checking.

thanks. please inform me when the bug fixed.

kwang2049 commented 3 years ago

Thanks for reporting this issue! We have located the bug: When one adds tokens to the encoder's lookup table, the _tie_encoder_decoder_weights function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder from encoder.config._name_or_path if tie_encoder_decoder=True and will contain more checking.

thanks. please inform me when the bug fixed.

Hi, ReySadeghi. The bug has been fixed since this commit https://github.com/UKPLab/sentence-transformers/commit/022b2ddb790a45be821066f7ff35f4b375a6cd97 . So please git clone the latest version and pip install -e . to try it:).

ReySadeghi commented 3 years ago

@kwang2049 Hi, I tried the latest version. running on CPU is ok but on GPU I got this Error:

Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 981, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 575, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 461, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 394, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 253, in forward mixed_query_layer = self.query(hidden_states) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear output = input.matmul(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) .................................................... and I tried "CUDA_LAUNCH_BLOCKING=1 python3.7 script.py" for more stack trace and got:

] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.

/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. Epoch: 0%| | 0/6 [00:00<?, ?it/s] Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 969, in forward past_key_values_length=past_key_values_length, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 204, in forward embeddings = inputs_embeds + token_type_embeddings RuntimeError: CUDA error: device-side assert triggered

kwang2049 commented 3 years ago

@kwang2049 Hi, I tried the latest version. running on CPU is ok but on GPU I got this Error:

Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 981, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 575, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 461, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 394, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 253, in forward mixed_query_layer = self.query(hidden_states) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear output = input.matmul(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) .................................................... and I tried "CUDA_LAUNCH_BLOCKING=1 python3.7 script.py" for more stack trace and got:

] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.

/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. Epoch: 0%| | 0/6 [00:00<?, ?it/s] Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 969, in forward past_key_values_length=past_key_values_length, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 204, in forward embeddings = inputs_embeds + token_type_embeddings RuntimeError: CUDA error: device-side assert triggered

Are you using the same script? Please try the code below:

from sentence_transformers import SentenceTransformer
from sentence_transformers import models, datasets, losses
from torch.utils.data import DataLoader

model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word]

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
]

train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)

This works fine on my server. If this does not work from your side, then I think it is either because of your wrong version of SBERT repo (I pass the test above using sentence-transformers==1.1.1) or a CUDA problem.

And if this also works from your side, then I think it is related to a new word/token. And you can do this to locate it: You can iterate over all the new words, create a sentence containing each of them and fit the TSDAE model for each of them. Your computer may throw an exception at a certain point and if that happened, please tell us which it is.

ReySadeghi commented 3 years ago

yes, I used latest version of SBERT and used the same script but still got error!!

I got this warning too, could this cause the problem?

/lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.

kwang2049 commented 3 years ago

yes, I used latest version of SBERT and used the same script but still got error!!

I got this warning too, could this cause the problem?

/lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.

Could you please run the code snippet mentioned above? Your warning seems to have nothing to do with the SBERT repo, since the pandas package is not required.

ReySadeghi commented 3 years ago

yeah, it's solved. sorry, the latest version hadn't installed carefully. thanks

ReySadeghi commented 3 years ago

@nreimers does the code support running on multi GPU?

ReySadeghi commented 3 years ago

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:

AssertionError: Non-consecutive added token '#سلام' found. Should have index 100005 but has index 100006 in saved vocabulary.

ReySadeghi commented 3 years ago

@nreimers hi, I tried TSDA code to train my model, but it doesn't give me any information about train loss during training.

nreimers commented 3 years ago

Train loss is not computed & plotted during training

kwang2049 commented 3 years ago

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:

AssertionError: Non-consecutive added token '#نوید_افکاری' found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)

from sentence_transformers import SentenceTransformer
from sentence_transformers import models

model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])

If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

ReySadeghi commented 3 years ago

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error: AssertionError: Non-consecutive added token '#نوید_افکاری' found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)

from sentence_transformers import SentenceTransformer
from sentence_transformers import models

model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])

If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

I tried this and it was ok, but actually I think the problem was due to some tokens that weren't in utf-8 encoding, when I removed them the problem was solved.