UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.73k stars 2.43k forks source link

Error with STS dataloader #1251

Closed lambdaofgod closed 2 years ago

lambdaofgod commented 2 years ago

I'm getting an error when I try to run training_stsbenchmark_bilstm.py example

I try to understand what's in the dataloader: next(iter(train_dataloader)) I get

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>

Whole message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-97295ce92ad5> in <module>
----> 1 for b in train_dataloader:
      2     print(b)
      3     break

~/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

~/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

~/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

~/.local/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     83         return [default_collate(samples) for samples in transposed]
     84 
---> 85     raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>

Running the whole script

I get the following


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-1f07fcdd242f> in <module>
      1 logging.info("Warmup-steps: {}".format(warmup_steps))
      2 # Train the model
----> 3 model.fit(train_objectives=[(train_dataloader, train_loss)],
      4           evaluator=evaluator,
      5           epochs=num_epochs,

~/.local/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py in fit(self, train_objectives, evaluator, epochs, steps_per_epoch, scheduler, warmup_steps, optimizer_class, optimizer_params, weight_decay, evaluation_steps, output_path, save_best_model, max_grad_norm, use_amp, callback, show_progress_bar, checkpoint_path, checkpoint_save_steps, checkpoint_save_total_limit)
    703                         skip_scheduler = scaler.get_scale() != scale_before_step
    704                     else:
--> 705                         loss_value = loss_model(features, labels)
    706                         loss_value.backward()
    707                         torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.local/lib/python3.8/site-packages/sentence_transformers/losses/CosineSimilarityLoss.py in forward(self, sentence_features, labels)
     37 
     38     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
---> 39         embeddings = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
     40         output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))
     41         return self.loss_fct(output, labels.view(-1))

~/.local/lib/python3.8/site-packages/sentence_transformers/losses/CosineSimilarityLoss.py in <listcomp>(.0)
     37 
     38     def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
---> 39         embeddings = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
     40         output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))
     41         return self.loss_fct(output, labels.view(-1))

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.local/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input)
    117     def forward(self, input):
    118         for module in self:
--> 119             input = module(input)
    120         return input
    121 

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.local/lib/python3.8/site-packages/sentence_transformers/models/LSTM.py in forward(self, features)
     30         sentence_lengths = torch.clamp(features['sentence_lengths'], min=1)
     31 
---> 32         packed = nn.utils.rnn.pack_padded_sequence(token_embeddings, sentence_lengths, batch_first=True, enforce_sorted=False)
     33         packed = self.encoder(packed)
     34         unpack = nn.utils.rnn.pad_packed_sequence(packed[0], batch_first=True)[0]

~/.local/lib/python3.8/site-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first, enforce_sorted)
    243 
    244     data, batch_sizes = \
--> 245         _VF._pack_padded_sequence(input, lengths, batch_first)
    246     return _packed_sequence_init(data, batch_sizes, sorted_indices, None)
    247 

RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

Environment:

sentence-transformers==2.1.0 torch==1.8.0+cu111

milmin commented 2 years ago

I get exactly the same error with sentence-transformers==2.1.0 when I run the very basic example from the doc:

from sentence_transformers import SentenceTransformer, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
next(iter(train_dataloader))

Error:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>
milmin commented 2 years ago

any news about this issue?

ncoop57 commented 2 years ago

@milmin and @lambdaofgod I had a similar issue with the dataloader. The issue is that SentenceTransformer overwrites the default collator (https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L629) to work with the InputExample class. So, that is why you are getting the error. If you want to test that your data pipeline is working you will need to import the smart_batching_collate function and set it as your Dataloader's collate_fn

lambdaofgod commented 2 years ago

@ncoop57 I think it solves this problem, thanks.

In order to solve this RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

In addition to that we need to change LSTM class, like this issue suggests

Lurrobert commented 2 years ago

smart_batching_collate

Thanks! That helped!

working code:

    train_dataset,
    shuffle=True,
    batch_size=train_batch_size,
    collate_fn=model.smart_batching_collate
)
carlosandrefernandes commented 1 year ago

smart_batching_collate

https://www.kaggle.com/code/rahulseetharaman/siamese-bert/notebook?scriptVersionId=102434046