Closed dmitrymailk closed 1 year ago
Hi, the current version only supports batch processing. The input is a text file and the output is a JSON file containing scores of the input.
I will try to implement a version to support the functionality that you have suggested.
The input is a text file and the output is a JSON file containing scores of the input.
Could you show me how I can achieve that? It will be acceptable for me too. You can see my project structure on a screenshot:
In the data folder, there is a dev folder containing the fed-dialogue.txt file. You can follow the format in that file to prepare your input data (one line should be something like 0\t[your dialogue with "|||" delimit the utteraces]\tyour dialogue with "|||" delimit the utteraces]
, name your file, for example, dummy_data.txt
and place into data/dev/
Then modify the scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh
file by replacing --eval_on fed-dialogue dstc9 persona-see
with your file name, such as --eval_on dummy_data
, the model should be placed in output/train/multitask_base_dailydialog_123456
Execute bash scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh
and you will find the json file, output/multitask_base_dailydialog_123456/expert_predictions.dummy_data.end.json
. Each entry contains three scores corresponding to the dialogue in your input file (coherence, likability, topic depth). The overall scores can be found in output/multitask_base_dailydialog_123456/predictions.dummy_data.end.json
output/train/multitask_base_dailydialog_234567
folder?What does this script do( turn level evaluation or dialog-level)? How can I do evaluation only for the last replica in dialogue?
scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh
I have file fed-small.txt
fed-small.txt
original Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of meeting?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I cannot tell you|||What can you tell me? Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of dog?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I hate|||I don't know
original Hi!|||Hey! How are you today?|||good|||I'm glad to hear that! What are your plans for today?|||I'm trying to find a good podcast to listen to|||What kinds of podcasts do you like?|||only those about Marvel corn!|||Do you like Hollywood Babble-Off?|||i haven't tried it...do you like it?|||It's two of my favorites, right up there with REDACTED_TERM.|||awesome! do you listen to a lot of birds?|||Not as much as I'd like, but I do like listening to NPR.|||where do you listen to podcasts? Spotify?|||I listen to them through iTunes.|||i like Spotify better...more options|||Yeah, I just don't has a money for Spotify. Hi!|||Hey! How are you today?|||good|||I'm glad to hear that! What are your plans for today?|||I'm trying to find a good podcast to listen to|||What kinds of podcasts do you like?|||only those about Marvel movies!|||Do you like Hollywood Babble-On?|||i haven't tried it...do you like it?|||It's one of my favorites, right up there with REDACTED_TERM.|||awesome! do you listen to a lot of podcasts?|||Not as much as I'd like, but I do like listening to NPR.|||where do you listen to podcasts? Spotify?|||I listen to them through iTunes.|||i like Spotify better...more options|||Yeah, I just don't have the money for Spotify.
I executed script bash scripts/eval/multitask_inference/eval_multi_head_dailydialog.sh
eval_multi_head_dailydialog.sh
export CUDA_VISIBLE_DEVICES=0
export dataset=dailydialog
for seed in 234567; do
python run.py \
--parallel \
--multi_head \
--eval_on fed-small \
--train_on ${dataset}_coherence ${dataset}_likeable ${dataset}_nli \
--load_from "output/train/multitask_base_${dataset}_${seed}" \
--output_dir "output/my_prediction_${dataset}_${seed}" \
--model_name_or_path "roberta_full_base" \
--criterion loss --seed ${seed};
done
and I got these files expert_predictions.fed-small.end.json
[
[
[0.3812446594238281, 0.19450366497039795, 0.09418931603431702],
[0.1512472778558731, 0.19958867132663727, 0.09207701683044434]
],
[
[0.6565149426460266, 0.7325885891914368, 0.4197925925254822],
[0.632584273815155, 0.7850334048271179, 0.39918893575668335]
]
]
predictions.fed-small.end.json
[
[0.22331254184246063, 0.14763765037059784],
[0.6029653549194336, 0.6056022047996521]
]
expert_predictions.fed-small.end.json
, the scores [0.3812446594238281, 0.19450366497039795, 0.09418931603431702]
are the coherence, likability, and topic depth ratings of the dialogue Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of meeting?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I cannot tell you|||What can you tell me?
and in predictions.fed-small.end.json
, 0.22331254184246063
is the overall score of the dialogue. [0.1512472778558731, 0.19958867132663727, 0.09207701683044434]
and 0.14763765037059784
correspond to the dialogue Hi!|||Hi! What's up?|||Nothing much, how about you|||Not much either.|||What are you doing|||Playing Terraria. What about you?|||Sitting in a meeting|||What kind of dog?|||Can't say|||It's probably boring, isn't it?|||Haha, yes!|||What is the meeting about?|||I hate|||I don't know
FineD-Eval is trained with preference learning. Hence, the input is original\t[dialogue A]\t[dialogue B]
, which means [dialogue A] should be rated higher than [dialogue B]
. random\t[dialogue A]\t[dialogue B]
means the opposite.
Thank you for you clarifications. It helped me a lot. 🤗
@e0397123 Thanks for the repo!
I wonder is there a way to evaluate a single dialogue instead of a pair of dialogues? Basically I have dialogues generated by several systems and I want to evaluate each system.
If the input has to be a pair of dialogues, does it make sense to input two dialogues with different contents (because they are generated by different systems)? The reason why I'm asking is that I thought during training, the input dialogues have similar content due to the strategy of how positive and negative samples are generated.
@e0397123 Also, how should I decide the order of dialogues in the input?
Based on what you said here the input is original\t[dialogue A]\t[dialogue B], which means [dialogue A] should be rated higher than [dialogue B]. random\t[dialogue A]\t[dialogue B] means the opposite.
It seems the order determines the score?
Yes, the order determines the score. The model is trained in a pairwise manner, original\t[dialogue A]\t[dialogue B] => dialogue A is better than dialogue B and random\t[dialogue A]\t[dialogue B] => dialogue B is better than dialogue A.
If you want to perform inference on a single dialogue, for example, [dialogue x], you can just prepare it as original\t[dialogue x]\t[dialogue x]. The model will output two same scores.
Thanks!
Hi, I want to use this metric in my project. I'm looking for a program interface like BLEU or ROUGE in package evaluate
For example:
OR
Could you give me an advice how I can achieve similar behavior with $FineD-Eval_{mu}$?