Open laphang opened 8 months ago
Hi! There's Real Toxicity in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom metric.py
. Think you should also be able to modify this to pass the model outputs to an arbitrary classifier rather than the API and/or also use a different dataset in the task yaml.
Thanks for the response, I'll keen an eye on the big-refactor getting merged into main!
Hi @laphang, did you come up with code to implement this? I'm exactly in the same position at the moment. Thanks!
Hi @danihinjos, no I didn't end up implementing this unfortunately.
I did try the real toxicity task, but found the perspective api too slow.
As an aside, I did gain more confidence in this current method after reading the Orca 2 paper, where they discuss using toxigen both for discriminative evaluation (the classification used in the current lm-eval toxigen task) and generative evaluation. And that models that were good at discriminative evaluation tended to generate less toxic outputs. https://www.microsoft.com/en-us/research/publication/orca-2-teaching-small-language-models-how-to-reason/
But it would be good to be able to explicitly do the generative evaluation too.
Agree, I would also find an implementation of toxigen super useful. Did you find another way outside of lm-evaluation-harness to run toxigen (relatively) easily?
@Thartvigsen wondering if you've got an implementation of evaluating causal models' toxicity (as opposed to their use for classifying toxicity which you contributed) using Toxigen, possibly?
@haileyschoelkopf ah sorry I missed this. The ToxiGen dataset just contains sentences w/ binary labels (hate vs. not hate) so I don't think it can be directly used for eval unless the data are restructured into pairs of hateful/non-hateful sentences? The score could be whether a causal model finds the hateful sentences much less likely than non-hateful sentences (and by how much). I don't have that implemented already, but it seems relatively straightforward.
OR if "evaluate causal models' toxicity w/ ToxiGen" means "run a model trained on ToxiGen on the outputs of a causal model"? This is how others evaluate language models w/ ToxiGen. They just use our RoBERTA model or HateBERT model, both of which were finetuned on ToxiGen.
@haileyschoelkopf The ToxiGen dataset contains responses with labels and prompts to generate these responses. As @laphang presented, the generation-based toxicity evaluation was used in the Llama 2 paper. Also, it seems popular.
I would like to evaluate the toxicity of the generations of my fine tuned models. I'm interested in using toxigen which seems popular (eg used in the Llama 2 paper).
However, looking at the current toxigen task, it seems to measure how well the LLM performs as a classifier when using the prompt in doc_to_tex()? (is my understanding of that correct?) https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/toxigen.py
However, here's the relevant section (Appendix A.4.7) from the Llama 2 paper (which follows on from Section 4.1) https://arxiv.org/pdf/2307.09288.pdf
They generate text from the LLM using the prompts from (a revised version of) the toxigen dataset, and then use a roberta classifier trained on the toxigen dataset to evaluate whether the generated outputs are toxic.
eg. the lead author of the toxigen paper has uploaded this toxigen roberta classifier https://huggingface.co/tomh/toxigen_roberta
1) Assuming my understanding of the way the toxigen task in eval harness is currently working is correct, I was wondering if the task could be modified (or a new task created) which used a toxigen roberta classifier to measure whether the LLM generations are toxic?
2) Alternatively, are there any other tasks that are useful in terms of measuring the toxicity of LLM generations?