embeddings-benchmark / arena

Code for the MTEB Arena
https://hf.co/spaces/mteb/arena
15 stars 7 forks source link

huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'mteb/index_stackexchange_### model a: bm25'. #33

Open Muennighoff opened 3 months ago

Muennighoff commented 3 months ago

Not sure what happened but saw this in the logs:

se.py", line 458, in result
2024-08-05 21:48:44 | ERROR | stderr |     return self.__get_result()
2024-08-05 21:48:44 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2024-08-05 21:48:44 | ERROR | stderr |     raise self._exception
2024-08-05 21:48:44 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-08-05 21:48:44 | ERROR | stderr |     result = self.fn(*self.args, **self.kwargs)
2024-08-05 21:48:44 | ERROR | stderr |   File "/data/niklas/arena/models.py", line 226, in retrieve
2024-08-05 21:48:44 | ERROR | stderr |     index = self.load_bm25_index(model_name, corpus)
2024-08-05 21:48:44 | ERROR | stderr |   File "/data/niklas/arena/models.py", line 164, in load_bm25_index
2024-08-05 21:48:44 | ERROR | stderr |     index.load_index()
2024-08-05 21:48:44 | ERROR | stderr |   File "/data/niklas/arena/retrieval/bm25_index.py", line 47, in load_index
2024-08-05 21:48:44 | ERROR | stderr |     self._create_index()
2024-08-05 21:48:44 | ERROR | stderr |   File "/data/niklas/arena/retrieval/bm25_index.py", line 35, in _create_index
2024-08-05 21:48:44 | ERROR | stderr |     retriever.save_to_hub(repo_id=f"mteb/{self.repo_name}", token=hf_token, corpus=passages)
2024-08-05 21:48:44 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/bm25s/hf.py", line 255, in save_to_hub
2024-08-05 21:48:44 | ERROR | stderr |     repo_url = api.create_repo(
2024-08-05 21:48:44 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
2024-08-05 21:48:44 | ERROR | stderr |     validate_repo_id(arg_value)
2024-08-05 21:48:44 | ERROR | stderr |   File "/env/lib/conda/gritkto/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
2024-08-05 21:48:44 | ERROR | stderr |     raise HFValidationError(
2024-08-05 21:48:44 | ERROR | stderr | huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'mteb/index_stackexchange_### model a: bm25'.
isaac-chung commented 3 months ago

Looking at where self.repo_name is defined: https://github.com/embeddings-benchmark/arena/blob/64a8780d596018912905523406621eed62a9a417/retrieval/bm25_index.py#L16 Maybe model_name has spaces in it, which is not alphanumeric?

Muennighoff commented 3 months ago

I think the problem is that sometimes the model name is turned into ### model a: bm25 rather than bm25 and this leads to this error; I'm not sure when exactly

isaac-chung commented 3 months ago

Maybe we can directly feed bm25 as the model_name here?

    def retrieve(self, query, corpus, model_name, topk=1):
        corpus_format = CORPUS_TO_FORMAT[corpus]

        if "BM25" in model_name:
            index = self.load_bm25_index(model_name, corpus)