embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.93k stars 268 forks source link

unable to push mmteb scores to model meta due to file size #1368

Open bwanglzu opened 1 week ago

bwanglzu commented 1 week ago

I patched the model meta with the latest model meta with mteb create_meta --results_folder results/{my model}/{my reivision} --output_path model_card.md --from_existing jina_embeddings-v3.md, this gives me a Readme of 28.9mb, and not able to publish scores to huggingface anymore, have you encountered such error?

jina-embeddings-v3|pr/62 ⇒ git push origin pr/62:refs/pr/62
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 2.72 MiB | 6.58 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
remote: -------------------------------------------------------------------------
remote: Your push was rejected because it contains files larger than 10 MiB.
remote: Please use https://git-lfs.github.com/ to store large files.
remote: See also: https://hf.co/docs/hub/repositories-getting-started#terminal
remote: 
remote: Offending files:
remote:   - README.md (ref: refs/pr/62)
remote: -------------------------------------------------------------------------
To https://huggingface.co/jinaai/jina-embeddings-v3
 ! [remote rejected] pr/62 -> refs/pr/62 (pre-receive hook declined)
error: failed to push some refs to 'https://huggingface.co/jinaai/jina-embeddings-v3'
Samoed commented 1 week ago

You can try to remove identation like this https://github.com/embeddings-benchmark/results/blob/main/reduce_large_json_files.py or you can install git-lfs

KennethEnevoldsen commented 1 week ago

Yea, the create_meta CLI is quite extensive when working with a lot of datasets. Instead, it might be better to submit the results to submit the results to simply:

https://github.com/embeddings-benchmark/results

The metadata on HF is also not really intended to include >100 datasets.