mteb.load_results raises pydantic error when trying to load all tasks

j0ma commented 2 weeks ago

Hello, and thanks for a great library!

I was trying to get all the results from all the tasks for some analysis work I'm doing but faced an error when trying to load all tasks.

Error description

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "./mteb-upstream-repo/mteb/load_results/load_results.py", line 177, in load_results
    _results = [MTEBResults.from_disk(f) for f in task_json_files]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./mteb-upstream-repo/mteb/load_results/load_results.py", line 177, in <listcomp>
    _results = [MTEBResults.from_disk(f) for f in task_json_files]
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "./mteb-upstream-repo/mteb/load_results/mteb_results.py", line 281, in from_disk
    raise e
  File "./mteb-upstream-repo/mteb/load_results/mteb_results.py", line 278, in from_disk
    obj = cls.model_validate(data)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonne/miniconda3/envs/bayeseval/lib/python3.11/site-packages/pydantic/main.py", line 568, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for MTEBResults
evaluation_time
  Field required [type=missing, input_value={'dataset_revision': 'a75...ame': 'BrightRetrieval'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing

Seems like it has to do with the BrightRetrieval task.

I also see a bunch of warnings for other tasks:

Already up to date.
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/LLM2Vec-Sheared-Llama-unsupervised/no_revision_available, extracting model_name and revision from the path
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score cosine_spearman not found in scores
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/voyage-lite-02-instruct/no_revision_available, extracting model_name and revision from the path
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score cosine_spearman not found in scores
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/rubert-tiny-turbo/8ce0cf757446ce9bb2d5f5a4ac8103c7a1049054, extracting model_name and revision from the path
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/LLM2Vec-Sheared-Llama-supervised/no_revision_available, extracting model_name and revision from the path
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score cosine_spearman not found in scores
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/LLM2Vec-Mistral-unsupervised/no_revision_available, extracting model_name and revision from the path
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score max_ap not found in scores
WARNING:mteb.load_results.mteb_results:Main score cosine_spearman not found in scores
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/titan-embed-text-v1/no_revision_available, extracting model_name and revision from the path
WARNING:mteb.load_results.load_results:model_meta.json not found in /home/jonne/.cache/mteb/results/results/gte-Qwen2-7B-instruct/no_revision_available, extracting model_name and revision from the path

Version information

Version: installed from git commit aa5479da71a40b545dd339d345101d3a02e688c3

> pip freeze
mteb @ git+https://github.com/embeddings-benchmark/mteb.git@aa5479da71a40b545dd339d345101d3a02e688c3

How to reproduce:

python -c 'import mteb;mteb.load_results(require_model_meta=False, validate_and_filter=False, models=None, tasks=None)'

Any idea what is going wrong? Happy to help fix this if it's not too complicated!

asparius commented 2 weeks ago

@j0ma thanks for your attention, this error is due to BrightRetrieval results missing evaluation time which is a required field in the MTEBResults template. I have created an issue in the results repository about this. It should be resolved after the appropriate results are uploaded.

KennethEnevoldsen commented 2 weeks ago

re: the warnings:

This is essentially due to backward compatibility, we allow you to load historical results files, but these are lacking some scores (e.g. max_ap). The warning are there to inform you that some results files are incomplete.

The most recent update has a require_model_meta=True flag, which by default ignores run without a model_meta.json file. This resolved the majority of issues (probably all I believe).

We might consider adding a flag for ignoring incomplete results files as well if that is not the case.

edit: we have also added tests to ensure that this should not be an issue going forward (though we will naturally update them if needed)

Muennighoff commented 1 week ago

Verified that python -c 'import mteb;mteb.load_results(require_model_meta=False, validate_and_filter=False, models=None, tasks=None)' is working now on my end when using the latest mteb; thanks for raising this issue!!

embeddings-benchmark / mteb