Open yogeswarl opened 1 year ago
@yogeswarl thanks for creating this issue.
Please note that we need chatgbt to generate the query reformulations. So, the last sentence, "chatgbt can perform suggesting documents for these queries" is not correct, to my understanding. right?
basically, we ask chatgbt in these ways:
1- here is the query q, please give us 10 reformulations/paraphrases of it? 2- here is the query q and its all relevant documents, give us 10 reformulations/paraphrases of the query?
It's like using T5 when it is trained and we asked for predictions.
pointer 2 is correct. This is what we will be doing! Our T5 once trained will be fed with relevant documents and it will generate queries. That is what we will infer from chatGPT as well.
For the comparison, please do these variations:
1- [like a expander] here is the query q, please give us 10 reformulations/paraphrases of it? 2- [like pretrained t5] here are all relevant documents of the query q, give us 10 reformulations/paraphrases of the query? 3- [like fine-tunned t5] here is the query and all relevant documents of the query q, give us 10 reformulations/paraphrases of the query?
Thank you.
Understood.
Hello Dr. @hosseinfani I have written function to test out chatGPT's capabilities. I got a paid edition and still I have been having server errors. This issue needs to be handled gracefully. but it will occur every 15 minutes due to the overload in server. Do you have any suggestions as to how to handle this issue.
Update: the above issue has been solved with the use of retrying package.
hello @hosseinfani , Here are the stats and graphical representation of GPT done on msmarco.passage I can run for one more because the predictions take too long to complete. Some stats:
query_category query_length mean_map
paraphrase_poor_gpt_query_mean_length 40.421 0.12944940000000002
paraphrase_poor_refined_query_mean_length 43.193 0.45138459999999997
paraphrase_poor_original_query_mean_length 37.004 0.0359629
paraphrase_somewhat_gpt_query_mean_length 40.398 0.4748991
paraphrase_somewhat_refined_query_mean_length 42.762 0.7004480999999999
paraphrase_somewhat_original_query_mean_length 34.964 0.29953789999999997
paraphrase_relevant_gpt_query_mean_length 41.969 0.7646921
paraphrase_relevant_refined_query_mean_length 42.971 0.8539836000000001
paraphrase_relevant_original_query_mean_length 39.078 0.8028525
finetune_poor_gpt_query_mean_length 60.017 0.5468902000000001
finetune_poor_refined_query_mean_length 43.193 0.45138459999999997
finetune_poor_original_query_mean_length 37.004 0.0359629
finetune_somewhat_gpt_query_mean_length 58.303 0.8195252000000001
finetune_somewhat_refined_query_mean_length 42.762 0.7004480999999999
finetune_somewhat_original_query_mean_length 34.964 0.29953789999999997
finetune_relevant_gpt_query_mean_length 56.367 0.8946335000000001
finetune_relevant_refined_query_mean_length 42.971 0.8539836000000001
finetune_relevant_original_query_mean_length 39.078 0.8028525
infer_poor_gpt_query_mean_length 55.03 0.6592458
infer_poor_refined_query_mean_length 43.193 0.45138459999999997
infer_poor_original_query_mean_length 37.004 0.0359629
infer_somewhat_gpt_query_mean_length 53.818 0.8387605000000001
infer_somewhat_refined_query_mean_length 42.762 0.7004480999999999
infer_somewhat_original_query_mean_length 34.964 0.29953789999999997
infer_relevant_gpt_query_mean_length 51.396 0.9012704
infer_relevant_refined_query_mean_length 42.971 0.8539836000000001
infer_relevant_original_query_mean_length 39.078 0.8028525
Some graphical representation:
I am running another set of poor,somewhat and relevant for user reformulation for aol title url.
@yogeswarl can you please explain what the categories are and put a paragraph of analysis here?
We had 3 thresholds: "poor" where original queries were from 0,0.24, "somewhat" = 0.25,0.49, "relevant" = 0.5,1.0 I went with the 3 categories you asked for. ChatGPT as an inference model -> pass only the documents ChatGPT as a paraphrase model -> pass only the queries ChatGPT as a finetuning model -> pass docs and query in tab separated.
Inference and Finetuned model performed better than T5 and original queries. Two issues arise with chatGPT: one is the time to run the model. it runs through an inference API so it is painstakingly slow. One prediction takes approximately 2 -3s according to the tqdm library. But t5 does one prediction under a second.
The stats are also posted in the above comment with the average mean query length and mean map
There should be only one barplot for this. I am going to make it more optimized
@yogeswarl thank you. Can you find the research paper for the chatgpt? we need to see why it's better than t5. Is it due to architecture or training dataset or ...
@hosseinfani https://arxiv.org/pdf/2005.14165.pdf here is the paper. I will give a quick summary about this by tomorrow evening.
From the paper I was able to delve these as why ChatGPT is better and few more things I did that were not considered in chatGPT.
One option I can think of here is to stop the number of words chatGPT can see, (i.e 512 words) and compare them both with respect to T5.
One problem with chatGPT is that it cannot limit the number of characters to the point like T5. the maximum output length is always much greater than both the mean of t5 and original query @hosseinfani . Should I redo this for fine-tune and inference
Made plots much smaller and cleaner
This Idea involves us asking ChatGPT to generate relevant queries based off the documents we feed.
We will sample about 10,000 documents that have the following criteria from the refined querys
Based on this data. We will be comparing how well ChatGPT can perform suggesting documents for these queries.