Closed zhilizju closed 1 year ago
Hey @zhilizju thanks for raising this. What exactly are you referring to? Like how to build and use a retriever?
Yes, I attempted to reproduce the results of BM25 and GPT-Index. When examining the Huggingface dataset, there were 11 instances out of 904 in my reproduced BM25 retrieval results that did not match with those in the file named 'questions_huggingface_bm25.jsonl'. Regarding GPT-Index, I noticed in the issue thread that you utilized Davinci v1 , which I also adopted. Specifically, 'text-search-davinci-query-001' was used for queries and 'text-search-davinci-doc-001' was applied for API docs. My approach involved the use of cosine similarity match, yet the results diverged considerably from those in 'questions_huggingface_gpt_index.jsonl'. Among the total of 904 instances, 328 were different. It seems that the results I've reproduced are closer to the oracle. Of course, this does not affect the conclusions drawn in the paper. However, I hope to follow your nice work, hence I am eager to accurately reproduce your results.
Thank you for your kind words and 893 out of 904 is a good match :) But yeah, you should get a 100% match - I think others have been able to reproduce it. For BM25, which variant of BM25 are you using? We used Okapi BM25 (BM25Okapi
from rank_bm25
). So we got the embedding for each of the documents (API's in our case) listed here and then did a cosine similarity match. Similarly, even for openai's embeddings, we didn't use the text-search, instead got the embeddings and compared for each query. Can you try using text-embedding-ada-002-v2
to get the embedding and then do a simple top-1 cosine similarity search? Let me know how it goes or if you run into any issues.
For BM25, I use BM25Okapi from rank_bm25, too.
Code:
import json
from rank_bm25 import BM25Okapi
def load_data(file_name):
with open(file_name, 'r') as f:
data = [json.loads(line) for line in f]
return data
def process_json_data(data):
texts = []
for item in data:
text = json.dumps(item)
texts.append(text)
return texts
def init_bm25_model(texts):
tokenized_corpus = [doc.split(" ") for doc in texts]
bm25 = BM25Okapi(tokenized_corpus)
return bm25
def search(query, bm25):
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
best_doc = bm25.get_top_n(tokenized_query, texts, n=1)[0]
return best_doc
data = load_data('data/api/huggingface_api.jsonl')
texts = process_json_data(data)
bm25 = init_bm25_model(texts)
query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_bm25.jsonl')
domains=[]
for item in query_data:
query = item['text']
best_doc = search(query, bm25)
print(best_doc)
For GPT-index, I tried text-embedding-ada-002-v2. The result is indeed closer to 'questions_huggingface_gpt_index.jsonl' than text-search-davinci-doc-001. Out of 904, there are still 123 different ones. The code is below:
Obtain the embeddings of query and API:
import json
import openai
openai.api_key = '****'
def load_data(file_name):
with open(file_name, 'r') as f:
data = [json.loads(line) for line in f]
return data
query_data = load_data('eval/eval-data/questions/huggingface/questions_huggingface_gpt_index.jsonl')
querys=[]
for item in query_data:
query = item['text']
querys.append(query)
def text_to_embedding(text):
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']
return embeddings
embeddings=[]
for query in querys:
print(query)
embedding = text_to_embedding(query)
embeddings.append(embedding) import pickle
def store_text_and_embeddings(texts, embeddings, filename):
assert len(texts) == len(embeddings), "The length of texts and embeddings must be the same
data = {
'texts': texts,
'embeddings': embeddings
}
with open(filename, 'wb') as f:
pickle.dump(data, f)
store_text_and_embeddings(querys, embeddings, 'ada_query_texts_and_embeddings.pkl')
def process_json_data(data):
texts = []
for item in data:
text = json.dumps(item)
texts.append(text)
return texts
API_data = load_data('data/api/huggingface_api.jsonl')
API_texts = process_json_data(API_data)
store_text_and_embeddings(texts, embeddings, 'ada_huggingface_api_texts_and_embeddings.pkl')
Calculate similarity:
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(query_embedding, text_embeddings):
query_norm = norm(query_embedding)
similarities = []
for text_embedding in text_embeddings:
text_norm = norm(text_embedding)
cosine_sim = dot(query_embedding, text_embedding) / (query_norm * text_norm)
similarities.append(cosine_sim)
return similarities
def find_most_similar_texts(query_embedding, texts, text_embeddings):
similarities = cosine_similarity(query_embedding, text_embeddings)
max_similarity_index = similarities.index(max(similarities))
return texts[max_similarity_index]
import pickle
def load_texts_and_embeddings(filename):
with open(filename, 'rb') as f:
data = pickle.load(f)
return data['texts'], data['embeddings']
texts, text_embeddings = load_texts_and_embeddings('ada_huggingface_api_texts_and_embeddings.pkl')
querys, query_embeddings= load_texts_and_embeddings('ada_query_texts_and_embeddings.pkl')
for i in range(len(querys)):
query=querys[i]
query_embedding=query_embeddings[i]
similar_text=find_most_similar_texts(query_embedding, texts, text_embeddings)
print(similar_text)
I don't know why. Can you give me some advice to reduce the difference? Perhaps you have different data preprocessing?
Hey @zhilizju I just merged #61 where we release our retriever code. We were able to verify 100% match. Can you try this? Thanks!
Great! Thank you very much! It seems that some parts of the code were missing, so I added them and reproduced the experiment. However, I still found some differences in the results. For BM25, out of 904, there are 63 differences, and for GPT-index, out of 904, there are 99 differences.
Hey @zhilizju do you mind sharing the code that you found missing as a PR :) Would welcome contributions!
Of course, but I have been busy with the rebuttal and paper submission lately. I will submit a PR (pull request) after a while. Thank you once again.
Thanks @zhilizju and good luck with the submissions :) Will close this for now!
Originally posted by @tianjunz in https://github.com/ShishirPatil/gorilla/issues/21#issuecomment-1567800181
Would you be willing to release the bm25 and gpt-index scripts to help the community reproduce the experimental results?