e0397123 / FineD-Eval

Repository for EMNLP-2022 Paper (FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation)
MIT License
12 stars 0 forks source link

How can I use this metric in my project? #1

Closed dmitrymailk closed 1 year ago

dmitrymailk commented 1 year ago

Hi, I want to use this metric in my project. I'm looking for a program interface like BLEU or ROUGE in package evaluate

For example:

# https://huggingface.co/spaces/evaluate-metric/rouge
import evaluate
rouge = evaluate.load('rouge')
results = rouge.compute(
    predictions=hypotheses,
    references=references
)
for key in results.keys():
    f1_score = results[key]
    print(f"{key} = {f1_score}")
# rouge1 = 0.40789473684210525
# rouge2 = 0.058823529411764705
# rougeL = 0.40789473684210525
# rougeLsum = 0.40789473684210525

OR

# https://huggingface.co/spaces/evaluate-metric/bleu
import evaluate
bleu = evaluate.load("bleu")

references = [
    [
        reference_1,
        reference_2,
        reference_3
    ],
]

candidates = [
    candidate_1,
    candidate_2
]
print("bleu")
for candidate in candidates:
    bleu_score = bleu.compute(
        predictions=[candidate],
        references=references
    )
    print(bleu_score['bleu'])
# bleu
# 0.5401725898595141
# 0.0

Could you give me an advice how I can achieve similar behavior with $FineD-Eval_{mu}$?

e0397123 commented 1 year ago

Hi, the current version only supports batch processing. The input is a text file and the output is a JSON file containing scores of the input.

I will try to implement a version to support the functionality that you have suggested.

dmitrymailk commented 1 year ago

The input is a text file and the output is a JSON file containing scores of the input.

Could you show me how I can achieve that? It will be acceptable for me too. You can see my project structure on a screenshot: image

e0397123 commented 1 year ago

In the data folder, there is a dev folder containing the fed-dialogue.txt file. You can follow the format in that file to prepare your input data (one line should be something like 0\t[your dialogue with "|||" delimit the utteraces]\tyour dialogue with "|||" delimit the utteraces], name your file, for example, dummy_data.txt and place into data/dev/

Then modify the scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh file by replacing --eval_on fed-dialogue dstc9 persona-see with your file name, such as --eval_on dummy_data, the model should be placed in output/train/multitask_base_dailydialog_123456

Execute bash scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh and you will find the json file, output/multitask_base_dailydialog_123456/expert_predictions.dummy_data.end.json. Each entry contains three scores corresponding to the dialogue in your input file (coherence, likability, topic depth). The overall scores can be found in output/multitask_base_dailydialog_123456/predictions.dummy_data.end.json

dmitrymailk commented 1 year ago
  1. Can I use model from output/train/multitask_base_dailydialog_234567 folder?
  2. What does this script do( turn level evaluation or dialog-level)? How can I do evaluation only for the last replica in dialogue?

    scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh I have file fed-small.txt fed-small.txt

    original    Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of meeting?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I cannot tell you|||What can you tell me?   Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of dog?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I hate|||I don't know
    original    Hi!|||Hey! How are you today?|||good|||I'm glad to hear that! What are your plans for today?|||I'm trying to find a good podcast to listen to|||What kinds of podcasts do you like?|||only those about Marvel corn!|||Do you like Hollywood Babble-Off?|||i haven't tried it...do you like it?|||It's two of my favorites, right up there with REDACTED_TERM.|||awesome! do you listen to a lot of birds?|||Not as much as I'd like, but I do like listening to NPR.|||where do you listen to podcasts? Spotify?|||I listen to them through iTunes.|||i like Spotify better...more options|||Yeah, I just don't has a money for Spotify.    Hi!|||Hey! How are you today?|||good|||I'm glad to hear that! What are your plans for today?|||I'm trying to find a good podcast to listen to|||What kinds of podcasts do you like?|||only those about Marvel movies!|||Do you like Hollywood Babble-On?|||i haven't tried it...do you like it?|||It's one of my favorites, right up there with REDACTED_TERM.|||awesome! do you listen to a lot of podcasts?|||Not as much as I'd like, but I do like listening to NPR.|||where do you listen to podcasts? Spotify?|||I listen to them through iTunes.|||i like Spotify better...more options|||Yeah, I just don't have the money for Spotify.

I executed script bash scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh eval_multi_head_dailydialog.sh

export CUDA_VISIBLE_DEVICES=0
export dataset=dailydialog

for seed in 234567; do
    python run.py \
        --parallel \
        --multi_head \
        --eval_on fed-small \
        --train_on ${dataset}_coherence ${dataset}_likeable ${dataset}_nli \
        --load_from "output/train/multitask_base_${dataset}_${seed}" \
        --output_dir "output/my_prediction_${dataset}_${seed}" \
        --model_name_or_path "roberta_full_base" \
        --criterion loss --seed ${seed};
done

and I got these files expert_predictions.fed-small.end.json

[
  [
    [0.3812446594238281, 0.19450366497039795, 0.09418931603431702],
    [0.1512472778558731, 0.19958867132663727, 0.09207701683044434]
  ],
  [
    [0.6565149426460266, 0.7325885891914368, 0.4197925925254822],
    [0.632584273815155, 0.7850334048271179, 0.39918893575668335]
  ]
]

predictions.fed-small.end.json

[
  [0.22331254184246063, 0.14763765037059784],
  [0.6029653549194336, 0.6056022047996521]
]
e0397123 commented 1 year ago
  1. 123456 and 234567 are different training seeds. You can pick any of the trained model for evaluation purpose.
  2. The script produces dialogue-level scores. For example, in expert_predictions.fed-small.end.json, the scores [0.3812446594238281, 0.19450366497039795, 0.09418931603431702] are the coherence, likability, and topic depth ratings of the dialogue Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of meeting?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I cannot tell you|||What can you tell me? and in predictions.fed-small.end.json, 0.22331254184246063 is the overall score of the dialogue.

[0.1512472778558731, 0.19958867132663727, 0.09207701683044434] and 0.14763765037059784 correspond to the dialogue Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of dog?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I hate|||I don't know

e0397123 commented 1 year ago

FineD-Eval is trained with preference learning. Hence, the input is original\t[dialogue A]\t[dialogue B], which means [dialogue A] should be rated higher than [dialogue B]. random\t[dialogue A]\t[dialogue B] means the opposite.

dmitrymailk commented 1 year ago

Thank you for you clarifications. It helped me a lot. 🤗

yujianll commented 1 year ago

@e0397123 Thanks for the repo!

I wonder is there a way to evaluate a single dialogue instead of a pair of dialogues? Basically I have dialogues generated by several systems and I want to evaluate each system.

If the input has to be a pair of dialogues, does it make sense to input two dialogues with different contents (because they are generated by different systems)? The reason why I'm asking is that I thought during training, the input dialogues have similar content due to the strategy of how positive and negative samples are generated.

yujianll commented 1 year ago

@e0397123 Also, how should I decide the order of dialogues in the input?

Based on what you said here the input is original\t[dialogue A]\t[dialogue B], which means [dialogue A] should be rated higher than [dialogue B]. random\t[dialogue A]\t[dialogue B] means the opposite. It seems the order determines the score?

e0397123 commented 1 year ago

Yes, the order determines the score. The model is trained in a pairwise manner, original\t[dialogue A]\t[dialogue B] => dialogue A is better than dialogue B and random\t[dialogue A]\t[dialogue B] => dialogue B is better than dialogue A.

If you want to perform inference on a single dialogue, for example, [dialogue x], you can just prepare it as original\t[dialogue x]\t[dialogue x]. The model will output two same scores.

yujianll commented 1 year ago

Thanks!