toxigen task measures toxicity classification rather than whether generations are toxic?

laphang commented 8 months ago

I would like to evaluate the toxicity of the generations of my fine tuned models. I'm interested in using toxigen which seems popular (eg used in the Llama 2 paper).

However, looking at the current toxigen task, it seems to measure how well the LLM performs as a classifier when using the prompt in doc_to_tex()? (is my understanding of that correct?) https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/toxigen.py

However, here's the relevant section (Appendix A.4.7) from the Llama 2 paper (which follows on from Section 4.1) https://arxiv.org/pdf/2307.09288.pdf

Toxicity. To measure the degree of generation of toxic language and hate speech across different groups, we use ToxiGen (Hartvigsen et al., 2022), a dataset that contains implicitly toxic and benign sentences mentioning 13 minority groups. We adopt a revised version of the dataset from Hosseini et al. (2023) that reduces noise by filtering out prompts for which annotators disagree on the target demographic group. We then use the default ToxiGen classifier tuned on RoBERTa (Liu et al., 2019) to measure the toxicity of generations of each of the LLMs.

They generate text from the LLM using the prompts from (a revised version of) the toxigen dataset, and then use a roberta classifier trained on the toxigen dataset to evaluate whether the generated outputs are toxic.

eg. the lead author of the toxigen paper has uploaded this toxigen roberta classifier https://huggingface.co/tomh/toxigen_roberta

1) Assuming my understanding of the way the toxigen task in eval harness is currently working is correct, I was wondering if the task could be modified (or a new task created) which used a toxigen roberta classifier to measure whether the LLM generations are toxic?

2) Alternatively, are there any other tasks that are useful in terms of measuring the toxicity of LLM generations?

baberabb commented 8 months ago

Hi! There's Real Toxicity in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom metric.py. Think you should also be able to modify this to pass the model outputs to an arbitrary classifier rather than the API and/or also use a different dataset in the task yaml.

laphang commented 8 months ago

Thanks for the response, I'll keen an eye on the big-refactor getting merged into main!

danihinjos commented 5 months ago

Hi @laphang, did you come up with code to implement this? I'm exactly in the same position at the moment. Thanks!

laphang commented 5 months ago

Hi @danihinjos, no I didn't end up implementing this unfortunately.

I did try the real toxicity task, but found the perspective api too slow.

As an aside, I did gain more confidence in this current method after reading the Orca 2 paper, where they discuss using toxigen both for discriminative evaluation (the classification used in the current lm-eval toxigen task) and generative evaluation. And that models that were good at discriminative evaluation tended to generate less toxic outputs. https://www.microsoft.com/en-us/research/publication/orca-2-teaching-small-language-models-how-to-reason/

But it would be good to be able to explicitly do the generative evaluation too.

tea-shaped commented 3 months ago

Agree, I would also find an implementation of toxigen super useful. Did you find another way outside of lm-evaluation-harness to run toxigen (relatively) easily?

haileyschoelkopf commented 3 months ago

@Thartvigsen wondering if you've got an implementation of evaluating causal models' toxicity (as opposed to their use for classifying toxicity which you contributed) using Toxigen, possibly?

Thartvigsen commented 1 month ago

@haileyschoelkopf ah sorry I missed this. The ToxiGen dataset just contains sentences w/ binary labels (hate vs. not hate) so I don't think it can be directly used for eval unless the data are restructured into pairs of hateful/non-hateful sentences? The score could be whether a causal model finds the hateful sentences much less likely than non-hateful sentences (and by how much). I don't have that implemented already, but it seems relatively straightforward.

OR if "evaluate causal models' toxicity w/ ToxiGen" means "run a model trained on ToxiGen on the outputs of a causal model"? This is how others evaluate language models w/ ToxiGen. They just use our RoBERTA model or HateBERT model, both of which were finetuned on ToxiGen.

ZeguanXiao commented 2 weeks ago

@haileyschoelkopf The ToxiGen dataset contains responses with labels and prompts to generate these responses. As @laphang presented, the generation-based toxicity evaluation was used in the Llama 2 paper. Also, it seems popular.

EleutherAI / lm-evaluation-harness

toxigen task measures toxicity classification rather than whether generations are toxic? #974