Open valdesguefa opened 1 year ago
We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.
since chatGPT uses GPT-3.5-Turbo. rej and rej* results for chatGPT should be the same or I'm wrong.
Le lun. 18 sept. 2023 à 09:26, jiawei @.***> a écrit :
We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.
— Reply to this email directly, view it on GitHub https://github.com/chen700564/RGB/issues/3#issuecomment-1722959817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR2EP3O34HYNHMDEPR5FJJLX3AATFANCNFSM6AAAAAA434R54I . You are receiving this because you authored the thread.Message ID: @.***>
(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api.
(2) Rej is measured by exact match. If th span -- insufficient information
is contained in the generation, this generation is regared as rejecting.
(3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.'
if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx.
In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.
Thank you for the clarification.
Le lun. 18 sept. 2023 à 13:19, jiawei @.***> a écrit :
(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api. (2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting. (3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.
— Reply to this email directly, view it on GitHub https://github.com/chen700564/RGB/issues/3#issuecomment-1723293983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR2EP3MWJ6C55QVQNSWTOCTX3A33NANCNFSM6AAAAAA434R54I . You are receiving this because you authored the thread.Message ID: @.***>
In this case, if a model has rej=35% and rej*=45%, can we say that its rejection rate is 35+45=80%?
No, both rej and rej* are rejection rate, but they are obtained by different ways — exact match and chatgpt respectively.
if the generation contains "insufficient information", chatGPT will consider the generation as a reject. isn't there a risk that chatgpt will also count the rej in rej*?
Yes, so rej is higher than rej. Rej may miss some rejection generation. We use chatgpt to obtain more precise rejection rate, i.e. rej while human evaluation is better.
in the article it says that gpt-3.5-turbo is used to measure the rejection rate. what explains this difference in results for chatGPT given that it is used as a reference?