GanjinZero / RRHF

[NIPS2023] RRHF & Wombat
780 stars 49 forks source link

Wombat-7B,Wombat-7B-gpt4 and ChatGPT Results on Comparison based on Vicuna test set, evaluation by gpt-4. #18

Open onlyfish79 opened 1 year ago

onlyfish79 commented 1 year ago
  1. Wombat-7B and ChatGPT Comparison based on Vicuna test set, score by GPT-4 Evaluation.
Wombat-7B: 599.0  average score: 7.5
ChatGPT: 710.5    average score: 8.9
wombat-7b / gpt35 = 84.31%
  1. Wombat-7B-gpt4 and ChatGPT Comparison based on Vicuna test set, score by GPT-4 Evaluation.
Wombat-7B-gpt4: 577.0  average score: 7.2
ChatGPT: 734.5         average score: 9.2
wombat-7b-gpt4 / gpt35 = 78.13%

Wombat-7B and Wombat-7B-gpt4: use the script recover_wombat_7b.sh

According to the above results, Wombat-7B has better results than Wombat-7B-gpt4, does the result meet expectations?

GanjinZero commented 1 year ago

Yes, it does meet our expectations, and we observe a similar score in Wombat-7B-gpt4 vs ChatGPT. The reason is Wombat-7B uses 5 responses for one query to train RRHF. Although Wombat-7B-gpt4 uses better responses, but it only contain 2 responses for one query. We think more diverse responses are the most important point of training RRHF.

GanjinZero commented 1 year ago

Another possible thing is Wombat-7B use responses from its initial checkpoint, while Wombat-7B-gpt4 does not use the response from its initial checkpoint. If RRHF is trying to improve based on itself, not using responses from its initial checkpoint worse RRHF's performance.

onlyfish79 commented 1 year ago

Understood, thank you for your response. May I ask about the upcoming roadmap for RRHF?

GanjinZero commented 1 year ago

Understood, thank you for your response. May I ask about the upcoming roadmap for RRHF?

Chain of thought reasoning & Scaling to 13b, 30b, 65b llama / alpaca