beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Training a T5 model #51

Open pritamdeka opened 2 years ago

pritamdeka commented 2 years ago

Hi @NThakur20 I was wondering if we can train a T5 model as when I was loading a T5 model from HF there seems to be an error.

thakur-nandan commented 2 years ago

Hi @pritamdeka,

Could you give a bit more context mentioned below:

  1. Which model exactly did you use? LInk to the model, if available?
  2. Which code did you use to train the model?
  3. What error did you encounter during training?
  4. If possible, a sample code snippet to reproduce your error.

Thank you!

Kind Regards, Nandan Thakur

pritamdeka commented 2 years ago

Hi @NThakur20 Thanks for the reply.

I used the train_sbert_BM25_hardnegs.py file for training using the SCIFACT dataset. Made a few changes for the T5 model such as changed the following line

word_embedding_model = models.Transformer(model_name, max_seq_length=300) to word_embedding_model = models.T5.T5(model_name, max_seq_length=300)

Also the model I used is castorini/monot5-base-msmarco

The error is something like this:

Epoch:   0% 0/1 [00:00<?, ?it/s]
Iteration:   0% 0/1121 [00:00<?, ?it/s]
Epoch:   0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/beir/examples/retrieval/training/train_sbert_BM25_hardnegs.py", line 132, in <module>
    use_amp=True)
  File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/train.py", line 148, in fit
    callback=callback, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 682, in fit
    data = next(data_iterator)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/NoDuplicatesDataLoader.py", line 41, in __iter__
    yield self.collate_fn(batch) if self.collate_fn is not None else batch
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 534, in smart_batching_collate
    tokenized = self.tokenize(texts[idx])
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 311, in tokenize
    return self._first_module().tokenize(texts)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/T5.py", line 55, in tokenize
    return self.tokenizer.encode(self.task_identifier+text)
TypeError: can only concatenate str (not "list") to str
rahmanidashti commented 1 year ago

Have you found a solution for this? Is there any T5-based models currently implemented in BeIR?