IBM / multidoc2dial

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents
Apache License 2.0
67 stars 22 forks source link

How to generate prediction file for sharetask? #7

Open DannyLuo-zp opened 2 years ago

DannyLuo-zp commented 2 years ago

Hi!

I am working on the sharetask and am wondering is there existing script in the repo to generate files like the sample predictions file included in the sharedtask folder?

Thanks a lot!

songfeng commented 2 years ago

Hi,

Thank you for the question!

We don't provide a script for generating prediction file in the format as there is no restriction on the final output format, which could be different from one another.

However, in case of the exact output by run_eval_rag_e2e.sh by the baseline code in this repo, there needs to be a "qid.txt" and "predictions.txt" that are mapped with the "$split.source" by line number, then we could do something like this,

import json
out = []
with open('predictions.txt') as fp_p:
    with open('ids.txt') as fp_id:
        for id_, text in zip(fp_id, fp_p):
            out += [{'id': id_.strip(), 'utterance': text.strip()}]
json.dump(out, open('output.json', 'w'), indent=4)

Does it make sense?

Thanks, Song

DannyLuo-zp commented 2 years ago

Hi Song,

Thanks so much for the reply! I truly appreciate your help!

It makes sense to me now how to generate custom prediction file using output generated by run_eval_rag_e2e.sh. Another quick question I have is: it seems that the sharetask input file mdd_dev_pub.json I downloaded from the competition is of slightly different format than what the baseline model takes (namely, there is no references or da as keys in a turn ). Wondering is the current script of the baseline model compatible to evaluate using those input and to generate prediction or is it something we have to customize?

Thanks, Danny

songfeng commented 2 years ago

Hi Danny,

For the files (e.g. mdd_dev_pub.json) provided at leaderboard website, they are meant for evaluation or test time, when annotations such as da and references are not available. So, only the conversational utterances are provided as input. For the current baseline model, da or references are not used to predict utterance.

However, a model could utilize those annotations during training in certain ways, but it will also need to predict them during test time along with or before the generation of utterance.

Let me know if there is any question!

Thanks~

-Song

DannyLuo-zp commented 2 years ago

Hi Song,

That all makes sense! Thanks for your help!

Best, Danny

DannyLuo-zp commented 2 years ago

Hi Song,

Can I ask another question? Based on my current understanding, run_eval_rag_re.sh will not output grounding prediction (as needed in sharedtask), but only output retrieval result on document level. I am wondering how to generate grounding prediction on a token level using this baseline model? Thanks a lot!

Best, Danny

songfeng commented 2 years ago

Hi Danny,

It might help clarify to refer to Section 2.2.1 in the paper.

As indicated in the data processing script, we can set $task as either grounding or generation, where grounding corresponds to the task on predicting the grounding span.

Feel free to ping if there is any question! Thanks.

-Song

DannyLuo-zp commented 2 years ago

Hi Song,

Thanks for your reply! Yes I have fine-tuned the model on both tasks. Could you clarify one more thing for me? To reproduce results of Table 4 (Evaluation results of Task I on grounding span generation task) in your paper, I should still set the argument --eval_mode to be e2e instead of retrieval. (I think this was what confused me before) Let me know if this is right! Thanks so much!

Best, Danny

songfeng commented 2 years ago

Hi Danny,

For the evaluation, --eval_mode is the setting for evaluation metrics, not for the task. For evaluation metrics, e2e corresponds to text generation metrics such as sacrebleu while retrieval corresponds to the retrieval metrics such as recall@n at passage level or document level.

What might be helpful to emphasize, our tasks in the our MultiDoc2Dial paper are "predicting" grounding content or utterance, using the same approach, a "retriever-reader" model. Even though the prediction is on grounding span, the approach still uses BART model (see RAG paper) to generate, not retrieve the span.

Thanks, Song

DannyLuo-zp commented 2 years ago

Hi Song,

Thanks! That makes sense, especially your clarification that the baseline is "predicting" the grounding content by a generator instead of predicting the (start,end) indices on actual document spans. Really appreciate your help!!

Best, Danny