UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.19k stars 2.47k forks source link

RerankingEvaluator taking too much time #1528

Open chaalic opened 2 years ago

chaalic commented 2 years ago

Hi,

I am currently working on the finetuning of "distiluse-base-multilingual-cased-v1", using MultipleNegativesRankingLoss and RerankingEvaluator, over a dataset of 700k (query, sentence) pairs. I'm currently facing a problem with the evaluator as it takes too much time for an evaluation over approximately 8000 unique evaluation pairs. I am using a gpu for the task. Is this normal behaviour?

Thank you for your help !

nreimers commented 2 years ago

Runtime depends on the number of queries and the number of docs per queries. If you have 8k queries with each 100 queries, the evaluator must encode 800k texts which takes quite some time.

chaalic commented 2 years ago

Thank you for your answer. However, I only have one query per document in the evaluation set, so I am not sure I understand the reason behind this. I have one other question please, is there a way to view the loss during the training of the model?

Thank you once again :)

nreimers commented 2 years ago

Loss during training is not supported yet. But you could create your own loss class that prints the loss.

One query per document does not make sense for the RerankingEvaluator .

The RerankingEvaluator excepts a query and a list of candidates, e.g. 20 candidates that are related to the query. It will then re-rank these 20 candidates and check at which position the relevant document is.

eliasws commented 2 years ago

@chaalic Probably the InformationRetrievalEvaluator might work better for your case?