chen700564 / RGB

Other
261 stars 22 forks source link

rejection rate of chatGPT #3

Open valdesguefa opened 1 year ago

valdesguefa commented 1 year ago

in the article it says that gpt-3.5-turbo is used to measure the rejection rate. what explains this difference in results for chatGPT given that it is used as a reference? image

chen700564 commented 1 year ago

We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.

valdesguefa commented 1 year ago

since chatGPT uses GPT-3.5-Turbo. rej and rej* results for chatGPT should be the same or I'm wrong.

Le lun. 18 sept. 2023 à 09:26, jiawei @.***> a écrit :

We use two methods for evaluation: exact match and chatgpt evaluation. When evaluating rejections, since LLMs sometimes do not completely follow our requirements for rejection text, we need to use chatgpt to determine whether the model has rejected or not.

— Reply to this email directly, view it on GitHub https://github.com/chen700564/RGB/issues/3#issuecomment-1722959817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR2EP3O34HYNHMDEPR5FJJLX3AATFANCNFSM6AAAAAA434R54I . You are receiving this because you authored the thread.Message ID: @.***>

chen700564 commented 1 year ago

(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api. (2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting. (3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.

valdesguefa commented 1 year ago

Thank you for the clarification.

Le lun. 18 sept. 2023 à 13:19, jiawei @.***> a écrit :

(1) All of the ChatGPT in this paper is the gpt-3.5-turbo api. (2) Rej is measured by exact match. If th span -- insufficient information is contained in the generation, this generation is regared as rejecting. (3) Rej* is measure by ChatGPT. Although in the instruction, we ask LLMs generate 'I can not answer the question because of the insufficient information in documents.' if the document does not contain the answer, LLMs can not always follow the instruction, and may generate some unexpected rejection sentences like The document does not provide information about xxx. In this case, we use chatgpt to determine whether the generation can be regarded as rejecting.

— Reply to this email directly, view it on GitHub https://github.com/chen700564/RGB/issues/3#issuecomment-1723293983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR2EP3MWJ6C55QVQNSWTOCTX3A33NANCNFSM6AAAAAA434R54I . You are receiving this because you authored the thread.Message ID: @.***>

valdesguefa commented 12 months ago

In this case, if a model has rej=35% and rej*=45%, can we say that its rejection rate is 35+45=80%?

chen700564 commented 12 months ago

No, both rej and rej* are rejection rate, but they are obtained by different ways — exact match and chatgpt respectively.

valdesguefa commented 12 months ago

if the generation contains "insufficient information", chatGPT will consider the generation as a reject. isn't there a risk that chatgpt will also count the rej in rej*?

chen700564 commented 12 months ago

Yes, so rej is higher than rej. Rej may miss some rejection generation. We use chatgpt to obtain more precise rejection rate, i.e. rej while human evaluation is better.