embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.88k stars 252 forks source link

Custom Model - Expected 2D array, got scalar array instead #120

Closed afurkank closed 1 year ago

afurkank commented 1 year ago

Hi,

I tried to write a script to evaluate Google's "universal sentence encoder - 4" embedding model. I'm using the STS22 dataset.

I took the "run_array_openaiv2.py" script and changed it.

I can attach the whole script if you want but here is what the model interface looks like:

class google_universal_encoder():
    def __init__(self, engine, task_name=None, save_emb=False, **kwargs) -> None:
        self.engine = engine
        self.save_emb = save_emb
        self.base_path = f"embeddings/{engine.split('/')[-1]}/"

    def encode(self,
        sentences,
        **kwargs):
        fin_embeddings=[]
        embedding_path = f"{self.base_path}/{self.task_name}_{sentences[0][:10]}_{sentences[-1][-10:]}.pickle"
        if sentences and os.path.exists(embedding_path):
            loaded = pickle.load(open(embedding_path, "rb"))
            fin_embeddings = loaded["fin_embeddings"]
        else:
            for _, sentence in enumerate(sentences):
                if not(sentence):
                    sentence = " "
                response = model([sentence])
                fin_embeddings.append(response)
        fin_embeddings = [arr[0] for arr in fin_embeddings]

        if fin_embeddings and self.save_emb:
            dump = {"fin_embeddings": fin_embeddings,}
            pickle.dump(dump, open(embedding_path, "wb"))    
        assert len(sentences) == len(fin_embeddings)
        return fin_embeddings

I didn't write any tokenization because the model can embed the sentences as they are. I guess it tokenizes them itself. (https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb)

The script produces the evaluation results and saves them into as a json file, however, it gives an error like this:

PS C:\Users\furkan\Desktop\Embedding Comparison> & C:/Users/furkan/anaconda3/python.exe "c:/Users/furkan/Desktop/Embedding Comparison/universal_encode_script.py"
2023-07-17 20:44:12.468570: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
module https://tfhub.dev/google/universal-sentence-encoder/4 loaded
Running task:  STS22
─────────────────────────────────────────────────────────────────────────────────────────────── Selected tasks  ────────────────────────────────────────────────────────────────────────────────────────────────
STS
    - STS22, p2p, crosslingual 18 pairs

Task: STS22, split: test, language: en. Running...
Exception ignored in: <function AtomicFunction.__del__ at 0x000001AF8F7EBEE0>
Traceback (most recent call last):
  File "C:\Users\furkan\anaconda3\lib\site-packages\tensorflow\python\eager\polymorphic_function\atomic_function.py", line 218, in __del__
TypeError: 'NoneType' object is not subscriptable
Exception ignored in: <function AtomicFunction.__del__ at 0x000001AF8F7EBEE0>
Traceback (most recent call last):
  File "C:\Users\furkan\anaconda3\lib\site-packages\tensorflow\python\eager\polymorphic_function\atomic_function.py", line 218, in __del__
TypeError: 'NoneType' object is not subscriptable

Is there anything I'm missing? The error message doesn't really mean anything to me. I would appreciate the help.

Muennighoff commented 1 year ago

Maybe try this script: https://huggingface.co/vprelovac/universal-sentence-encoder-4 They already evaluated that model on a bunch of MTEB tasks