Closed ryanpram closed 3 years ago
I find the same problem. It seems that this code is not completed, it lacks of some preprocessing code (extracting the dataset examples setting the metric of output) for task Wikicorpus and Trec-QA. I find similar preprocessing code in the repo https://github.com/tahmedge/CETE-LREC/blob/master/CETE%20Fine-Tuning/HuggingFacePytorchTransformer/examples/utils_glue.py#L390 maybe it can be reused here.
@liudonglei hi thank a lot for your reply. i see, i'll try it later
Hi @ryanpram and @liudonglei !
1) We have written in the README that our patch can be used with any target dataset (e.g. Wiki-QA or TREC-QA) as long as it is formatted similar to ASNQ where in a single .tsv
file contains the <Question> <TAB> <Candidate> <TAB> <Label>
per line of the file. Additional DataProcessors can be added for different input formats.
2) For computing MAP and MRR, you can use the following two functions which take as input the list of questions, list of labels and list of predictions (Note that questions
has repeated question entries for each answer candidate):
'''
questions : list of questions in the dataset
answers : list of answers in the dataset
labels : list of 0/1 labels corresponding to if answer is correct for question
predictions : list of probability scores from the a QA model for question-answer pairs
'''
def mean_average_precision(questions, labels, predictions):
question_results = {}
#Aggregating (prediction, label) tuples specific to a question from all answer candidates
for row in zip(questions, predictions, labels):
if row[0] not in question_results:
question_results[row[0]] = []
question_results[row[0]].append((row[1], row[2]))
sum_AP = 0.0
for q in question_results:
_scores, _labels = zip(*sorted(question_results[q], reverse=True))
if sum(_labels) == 0: continue #All incorrect answers for a question
if len(_labels) == 0: continue #No candidate answer for a question
if len(_labels) == sum(_labels): continue #All correct answers for a question
sum_question_AP_at_k = num_correct_at_k = position=0
while position < len(_labels):
correct_or_incorrect = (_labels[position]==1)
num_correct_at_k += correct_or_incorrect
sum_question_AP_at_k += correct_or_incorrect * num_correct_at_k /(position+1)
position+=1
sum_AP+=(sum_question_AP_at_k/num_correct_at_k)
MAP = sum_AP/len(question_results)
return MAP
def mean_reciprocal_rank(questions, labels, predictions):
question_results = {}
#Aggregating (prediction, label) tuples specific to a question from all answer candidates
for row in zip(questions, predictions, labels):
if row[0] not in question_results:
question_results[row[0]] = []
question_results[row[0]].append((row[1], row[2]))
reciprocal_ranks = []
sum_RR = 0.0
for q in question_results:
_scores, _labels = zip(*sorted(question_results[q], reverse=True))
if sum(_labels) == 0: continue #All incorrect answers for a question
if len(_labels) == 0: continue #No candidate answer for a question
if len(_labels) == sum(_labels): continue #All correct answers for a question
for idx, label in enumerate(_labels, 1):
if label == True:
sum_RR+=1.0/idx
break
MRR = sum_RR/len(question_results)
return MRR
@liudonglei hi thank a lot for your reply. i see, i'll try it later
请问您能把这个代码跑通吗?
Hi,
How you get the MAP and MRR score as the paper reported? The metric provided in the code only simple accuracy.
Thanks