Closed fabrahman closed 11 months ago
Hi FaezeBr,
Thanks for your question!
We did notice that the evaluator may result in a few false positives. To address these false positives, what we did in our evaluation is to complement the evaluator with the rule-based matching by the Zou et al. paper, with a wider range of rejection strings. Specifically, we flag a response as misaligned if the evaluator outputs >0.5 value and certain patterns of string are not present in the response. (It appears that we have not included this level of detail in our current manuscript; we will include it in a future version.) Following this adjustment, the agreement between the final scores and human evaluations for examples in our dataset exceeds 90%.
In the meantime, please do not hesitate to try your own risk detection model if you find it achieves a better score. Developing a more reliable risk detection model is also part of our ongoing efforts.
Hi,
Thanks for providing this resource. I have been using your evaluator for measuring the ASR scores. Though I am not getting good quality scores and I wonder if you uploaded the correct checkpoints?
I have many examples to showcase the issue though since most of the prompts are harmful, I copy/paste a few here:
Below is how I am using it:
Would appreciate your comment on this.