Open ReySadeghi opened 3 years ago
Looks like some issue with CUDA. Don't know how to fix it
Hi ReySadeghi, could you please run on CPU and see whether there is still a problem?
Hi ReySadeghi, could you please run on CPU and see whether there is still a problem?
Hi, in one case I tried and Got this error: indexerror: list index out of range python
and in another cases that I tried, RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED is still remain.
Could you please paste here the whole training script and also the whole log?
Could you please paste here the whole training script and also the whole log?
training script:
from sentence_transformers import SentenceTransformer, LoggingHandler from sentence_transformers import models, util, datasets, evaluation, losses from torch.utils.data import DataLoader
import nltk
vocab=[] with open('vocab30k.txt', mode='r',encoding="utf8",errors='ignore') as file2: for line2 in file2: line2=line2.split('\n')[0] line2=line2.strip() vocab.append(line2)
vocab=vocab[:10000]
model_name = 'HooshvareLab/bert-fa-base-uncased' word_embedding_model = models.Transformer(model_name,max_seq_length=250)
word_embedding_model.tokenizer.add_tokens(vocab) word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False) model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences=[] with open('fa5M_shuffeled.txt', mode='r',encoding="utf8",errors='ignore') as file2: for line2 in file2: line2=line2.split('\n')[0] line2=line2.strip() train_sentences.append(line2)
train_sentences=train_sentences[:2000000]
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, weight_decay=0, scheduler='constantlr', optimizer_params={'lr': 3e-5}, show_progress_bar=True )
.................................................. my coda version: 11.3
the Error:
lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError. warnings.warn(msg) Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
Does it work when you use bert-base-uncased?
Also check that you have a recent version of Pytorch and transformers
Does it work when you use bert-base-uncased?
Also check that you have a recent version of Pytorch and transformers
I edited it, actually the model name is 'HooshvareLab/bert-fa-base-uncased'.
Thanks for reporting this issue!
We have located the bug: When one adds tokens to the encoder's lookup table, the _tie_encoder_decoder_weights
function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder from encoder.config._name_or_path
if tie_encoder_decoder=True
and will contain more checking.
Thanks for reporting this issue! We have located the bug: When one adds tokens to the encoder's lookup table, the
_tie_encoder_decoder_weights
function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder fromencoder.config._name_or_path
iftie_encoder_decoder=True
and will contain more checking.
thanks. please inform me when the bug fixed.
Thanks for reporting this issue! We have located the bug: When one adds tokens to the encoder's lookup table, the
_tie_encoder_decoder_weights
function will tie the weights between encoder&decoder and thus make the encoder's lookup table back to the original one (since the decoder is initialized by the original checkpoint). We have found the solution and will fix it soon. The future version will initialize the decoder fromencoder.config._name_or_path
iftie_encoder_decoder=True
and will contain more checking.thanks. please inform me when the bug fixed.
Hi, ReySadeghi. The bug has been fixed since this commit https://github.com/UKPLab/sentence-transformers/commit/022b2ddb790a45be821066f7ff35f4b375a6cd97 . So please git clone the latest version and pip install -e .
to try it:).
@kwang2049 Hi, I tried the latest version. running on CPU is ok but on GPU I got this Error:
Traceback (most recent call last):
File "finetune_tsda.py", line 53, in cublasCreate(handle)
....................................................
and I tried "CUDA_LAUNCH_BLOCKING=1 python3.7 script.py" for more stack trace and got:
] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize
failed.
Epoch: 0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "finetune_tsda.py", line 53, in
@kwang2049 Hi, I tried the latest version. running on CPU is ok but on GPU I got this Error:
Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 981, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 575, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 461, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 394, in forward output_attentions, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 253, in forward mixed_query_layer = self.query(hidden_states) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear output = input.matmul(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle)
.................................................... and I tried "CUDA_LAUNCH_BLOCKING=1 python3.7 script.py" for more stack trace and got:] Assertion
srcIndex < srcSelectDimSize
failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [125,0,0] AssertionsrcIndex < srcSelectDimSize
failed./pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [126,0,0] Assertion
srcIndex < srcSelectDimSize
failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [171,0,0], thread: [127,0,0] AssertionsrcIndex < srcSelectDimSize
failed. Epoch: 0%| | 0/6 [00:00<?, ?it/s] Traceback (most recent call last): File "finetune_tsda.py", line 53, in show_progress_bar=True File "/usr/local/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 567, in fit loss_value = loss_model(features, labels) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 90, in forward reps = self.encoder(source_features)['sentence_embedding'] # (bsz, hdim) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 969, in forward past_key_values_length=past_key_values_length, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 204, in forward embeddings = inputs_embeds + token_type_embeddings RuntimeError: CUDA error: device-side assert triggered
Are you using the same script? Please try the code below:
from sentence_transformers import SentenceTransformer
from sentence_transformers import models, datasets, losses
from torch.utils.data import DataLoader
model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)
existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word]
print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences=[
'A sentence containing <new_word_1> and <new_word_2>.',
'A sentence containing only <new_word_2>.',
'A sentence containing <سلامسلام>',
f'A sentence containing {existing_word}'
]
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True
)
This works fine on my server. If this does not work from your side, then I think it is either because of your wrong version of SBERT repo (I pass the test above using sentence-transformers==1.1.1) or a CUDA problem.
And if this also works from your side, then I think it is related to a new word/token. And you can do this to locate it: You can iterate over all the new words, create a sentence containing each of them and fit the TSDAE model for each of them. Your computer may throw an exception at a certain point and if that happened, please tell us which it is.
yes, I used latest version of SBERT and used the same script but still got error!!
I got this warning too, could this cause the problem?
/lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
yes, I used latest version of SBERT and used the same script but still got error!!
I got this warning too, could this cause the problem?
/lib/python3.7/site-packages/pandas/compat/init.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
Could you please run the code snippet mentioned above? Your warning seems to have nothing to do with the SBERT repo, since the pandas package is not required.
yeah, it's solved. sorry, the latest version hadn't installed carefully. thanks
@nreimers does the code support running on multi GPU?
@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:
AssertionError: Non-consecutive added token '#سلام' found. Should have index 100005 but has index 100006 in saved vocabulary.
@nreimers hi, I tried TSDA code to train my model, but it doesn't give me any information about train loss during training.
Train loss is not computed & plotted during training
@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:
AssertionError: Non-consecutive added token '#نوید_افکاری' found. Should have index 100005 but has index 100006 in saved vocabulary.
Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)
from sentence_transformers import SentenceTransformer
from sentence_transformers import models
model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)
existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']
print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_sentences=[
'A sentence containing <new_word_1> and <new_word_2>.',
'A sentence containing only <new_word_2>.',
'A sentence containing <سلامسلام>',
f'A sentence containing {existing_word}'
'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]
model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])
If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab
variable above into your new token list and try again.
@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error: AssertionError: Non-consecutive added token '#نوید_افکاری' found. Should have index 100005 but has index 100006 in saved vocabulary.
Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)
from sentence_transformers import SentenceTransformer from sentence_transformers import models model_name = 'HooshvareLab/bert-fa-base-uncased' word_embedding_model = models.Transformer(model_name, max_seq_length=250) existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000] vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>'] print('Before:', word_embedding_model.auto_model.embeddings) word_embedding_model.tokenizer.add_tokens(vocab) word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer)) print('Now:', word_embedding_model.auto_model.embeddings) pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False) model = SentenceTransformer(modules=[word_embedding_model, pooling_model]) train_sentences=[ 'A sentence containing <new_word_1> and <new_word_2>.', 'A sentence containing only <new_word_2>.', 'A sentence containing <سلامسلام>', f'A sentence containing {existing_word}' 'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu' ] model.save('sbert_tokens_added') model = SentenceTransformer('sbert_tokens_added') print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])
If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the
vocab
variable above into your new token list and try again.
I tried this and it was ok, but actually I think the problem was due to some tokens that weren't in utf-8 encoding, when I removed them the problem was solved.
Hi, I used TSDA method to pretrain a BERT model on a corpus of sentences and I got this error:
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle)
and then used CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM] to trace the error and got this:
RuntimeError: CUDA error: device-side assert triggered
any help?