Open DannyLuo-zp opened 2 years ago
Hi,
Thank you for the question!
We don't provide a script for generating prediction file in the format as there is no restriction on the final output format, which could be different from one another.
However, in case of the exact output by run_eval_rag_e2e.sh
by the baseline code in this repo,
there needs to be a "qid.txt" and "predictions.txt" that are mapped with the "$split.source" by line number,
then we could do something like this,
import json
out = []
with open('predictions.txt') as fp_p:
with open('ids.txt') as fp_id:
for id_, text in zip(fp_id, fp_p):
out += [{'id': id_.strip(), 'utterance': text.strip()}]
json.dump(out, open('output.json', 'w'), indent=4)
Does it make sense?
Thanks, Song
Hi Song,
Thanks so much for the reply! I truly appreciate your help!
It makes sense to me now how to generate custom prediction file using output generated by run_eval_rag_e2e.sh
.
Another quick question I have is: it seems that the sharetask input file mdd_dev_pub.json
I downloaded from the competition is of slightly different format than what the baseline model takes (namely, there is no references
or da
as keys in a turn
). Wondering is the current script of the baseline model compatible to evaluate using those input and to generate prediction or is it something we have to customize?
Thanks, Danny
Hi Danny,
For the files (e.g. mdd_dev_pub.json
) provided at leaderboard website, they are meant for evaluation or test time, when annotations such as da
and references
are not available. So, only the conversational utterances are provided as input. For the current baseline model, da
or references
are not used to predict utterance
.
However, a model could utilize those annotations during training in certain ways, but it will also need to predict them during test time along with or before the generation of utterance
.
Let me know if there is any question!
Thanks~
-Song
Hi Song,
That all makes sense! Thanks for your help!
Best, Danny
Hi Song,
Can I ask another question? Based on my current understanding, run_eval_rag_re.sh
will not output grounding
prediction (as needed in sharedtask), but only output retrieval result on document level. I am wondering how to generate grounding prediction on a token level using this baseline model? Thanks a lot!
Best, Danny
Hi Song,
Thanks for your reply! Yes I have fine-tuned the model on both tasks. Could you clarify one more thing for me? To reproduce results of Table 4 (Evaluation results of Task I on grounding span generation task) in your paper, I should still set the argument --eval_mode
to be e2e
instead of retrieval
. (I think this was what confused me before) Let me know if this is right! Thanks so much!
Best, Danny
Hi Danny,
For the evaluation, --eval_mode
is the setting for evaluation metrics, not for the task. For evaluation metrics, e2e
corresponds to text generation metrics such as sacrebleu
while retrieval
corresponds to the retrieval metrics such as recall@n
at passage level or document level.
What might be helpful to emphasize, our tasks in the our MultiDoc2Dial paper are "predicting" grounding content or utterance, using the same approach, a "retriever-reader" model. Even though the prediction is on grounding span
, the approach still uses BART model (see RAG paper) to generate, not retrieve the span
.
Thanks, Song
Hi Song,
Thanks! That makes sense, especially your clarification that the baseline is "predicting" the grounding content by a generator instead of predicting the (start,end) indices on actual document spans. Really appreciate your help!!
Best, Danny
Hi!
I am working on the sharetask and am wondering is there existing script in the repo to generate files like the sample predictions file included in the sharedtask folder?
Thanks a lot!