Perturbator training using PANDA dataset

Hi Rebecca,

I've been working on replicating our paper "Perturbation Augmentation for Fairer NLP. 2022" by fine-tuning a T5 model on the PANDA dataset. And I have a few questions:

1) I was wondering if you use a task name similar to "summarize:" before tokenization. For example:

before tokenization, you add the following text from the "selected_Word" and "perturbation category" columns to the text in the "original" column:

"perturbate his to Women: he has his work done".

OR You just use the text in the "original" text as is. But then how to you specify what category to change? For example:

For a sentence like "she likes her African-American dialect" how would you specify to change the gender or the ethnicity mentioned in the text?

2) Do you pre-process the data in PANDA before using it? I use the standard pre-rpocessing of removing numbers, quote marks, new lines, ..etc. Is there an extra preprocessing performed needed?

3) You mentioned in the paper, that you compared the BART model performance to a seq-2-seq model but you did not mention the training parameters of the seq-2-seq model. Can you share that here?

4) The ROUGE_LSUM score of m T5 model is 92, but When I apply the model to data outside the test set, the performance if not good. Was that also our experience?

5) Can you share the code that you used to train the seq-2-seq model so I can compare it to my code?

Best, Fatma

facebookresearch / ResponsibleNLP

Perturbator training using PANDA dataset #10