ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.52k stars 1.01k forks source link

The bm25 and gpt-index scripts ? #58

Closed zhilizju closed 1 year ago

zhilizju commented 1 year ago
          For the different retrievers, we use bm25 (https://en.wikipedia.org/wiki/Okapi_BM25), gpt-index simply uses `Davinci v1` from OpenAI to embed all the documents and do simple cosine similarity match during the inference time. For oracle, we just provided the golden truth answer to Gorilla. Hope this helps and let me know if there are any further questions!

Originally posted by @tianjunz in https://github.com/ShishirPatil/gorilla/issues/21#issuecomment-1567800181

Would you be willing to release the bm25 and gpt-index scripts to help the community reproduce the experimental results?

ShishirPatil commented 1 year ago

Hey @zhilizju thanks for raising this. What exactly are you referring to? Like how to build and use a retriever?

zhilizju commented 1 year ago

Yes, I attempted to reproduce the results of BM25 and GPT-Index. When examining the Huggingface dataset, there were 11 instances out of 904 in my reproduced BM25 retrieval results that did not match with those in the file named 'questions_huggingface_bm25.jsonl'. Regarding GPT-Index, I noticed in the issue thread that you utilized Davinci v1 , which I also adopted. Specifically, 'text-search-davinci-query-001' was used for queries and 'text-search-davinci-doc-001' was applied for API docs. My approach involved the use of cosine similarity match, yet the results diverged considerably from those in 'questions_huggingface_gpt_index.jsonl'. Among the total of 904 instances, 328 were different. It seems that the results I've reproduced are closer to the oracle. Of course, this does not affect the conclusions drawn in the paper. However, I hope to follow your nice work, hence I am eager to accurately reproduce your results.

ShishirPatil commented 1 year ago

Thank you for your kind words and 893 out of 904 is a good match :) But yeah, you should get a 100% match - I think others have been able to reproduce it. For BM25, which variant of BM25 are you using? We used Okapi BM25 (BM25Okapi from rank_bm25). So we got the embedding for each of the documents (API's in our case) listed here and then did a cosine similarity match. Similarly, even for openai's embeddings, we didn't use the text-search, instead got the embeddings and compared for each query. Can you try using text-embedding-ada-002-v2 to get the embedding and then do a simple top-1 cosine similarity search? Let me know how it goes or if you run into any issues.

zhilizju commented 1 year ago

For BM25, I use BM25Okapi from rank_bm25, too.
Code:

import json
from rank_bm25 import BM25Okapi

def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data

def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts

def init_bm25_model(texts):
    tokenized_corpus = [doc.split(" ") for doc in texts]
    bm25 = BM25Okapi(tokenized_corpus)
    return bm25

def search(query, bm25):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    best_doc = bm25.get_top_n(tokenized_query, texts, n=1)[0]
    return best_doc

data = load_data('data/api/huggingface_api.jsonl')  
texts = process_json_data(data)

bm25 = init_bm25_model(texts)

query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_bm25.jsonl')  
domains=[]
for item in query_data:
    query = item['text']
    best_doc = search(query, bm25)
    print(best_doc)

For GPT-index, I tried text-embedding-ada-002-v2. The result is indeed closer to 'questions_huggingface_gpt_index.jsonl' than text-search-davinci-doc-001. Out of 904, there are still 123 different ones. The code is below:

Obtain the embeddings of query and API:

import json
import openai
openai.api_key = '****' 

def load_data(file_name):
    with open(file_name, 'r') as f:
        data = [json.loads(line) for line in f]
    return data

query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_gpt_index.jsonl')  
querys=[]
for item in query_data:
    query = item['text']
    querys.append(query)

def text_to_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    embeddings = response['data'][0]['embedding']
    return embeddings

embeddings=[]
for query in querys:
    print(query)
    embedding = text_to_embedding(query)
    embeddings.append(embedding)  import pickle

def store_text_and_embeddings(texts, embeddings, filename):
    assert len(texts) == len(embeddings), "The length of texts and embeddings must be the same

    data = {
        'texts': texts,
        'embeddings': embeddings
    }

    with open(filename, 'wb') as f:
        pickle.dump(data, f)

store_text_and_embeddings(querys, embeddings, 'ada_query_texts_and_embeddings.pkl')

def process_json_data(data):
    texts = []
    for item in data:
        text = json.dumps(item)
        texts.append(text)
    return texts

API_data = load_data('data/api/huggingface_api.jsonl')  
API_texts = process_json_data(API_data)
store_text_and_embeddings(texts, embeddings, 'ada_huggingface_api_texts_and_embeddings.pkl')

Calculate similarity:

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(query_embedding, text_embeddings):
    query_norm = norm(query_embedding)
    similarities = []
    for text_embedding in text_embeddings:
        text_norm = norm(text_embedding)
        cosine_sim = dot(query_embedding, text_embedding) / (query_norm * text_norm)
        similarities.append(cosine_sim)

    return similarities

def find_most_similar_texts(query_embedding, texts, text_embeddings):
    similarities = cosine_similarity(query_embedding, text_embeddings)
    max_similarity_index = similarities.index(max(similarities))

    return texts[max_similarity_index]

import pickle
def load_texts_and_embeddings(filename):
    with open(filename, 'rb') as f:
        data = pickle.load(f)

    return data['texts'], data['embeddings']
texts, text_embeddings = load_texts_and_embeddings('ada_huggingface_api_texts_and_embeddings.pkl')
querys, query_embeddings= load_texts_and_embeddings('ada_query_texts_and_embeddings.pkl')  

for i in range(len(querys)):
    query=querys[i]
    query_embedding=query_embeddings[i]
    similar_text=find_most_similar_texts(query_embedding, texts, text_embeddings)
    print(similar_text)

I don't know why. Can you give me some advice to reduce the difference? Perhaps you have different data preprocessing?

ShishirPatil commented 1 year ago

Hey @zhilizju I just merged #61 where we release our retriever code. We were able to verify 100% match. Can you try this? Thanks!

zhilizju commented 1 year ago

Great! Thank you very much! It seems that some parts of the code were missing, so I added them and reproduced the experiment. However, I still found some differences in the results. For BM25, out of 904, there are 63 differences, and for GPT-index, out of 904, there are 99 differences.

ShishirPatil commented 1 year ago

Hey @zhilizju do you mind sharing the code that you found missing as a PR :) Would welcome contributions!

zhilizju commented 1 year ago

Of course, but I have been busy with the rebuttal and paper submission lately. I will submit a PR (pull request) after a while. Thank you once again.

ShishirPatil commented 1 year ago

Thanks @zhilizju and good luck with the submissions :) Will close this for now!