PlusLabNLP / DEGREE

Code for our NAACL-2022 paper DEGREE: A Data-Efficient Generation-Based Event Extraction Model.
Apache License 2.0
72 stars 12 forks source link

How to calculate ''trigger_cls''? #6

Closed OStars closed 1 year ago

OStars commented 1 year ago

Hello, I found that there is only trigger_id evaluation metric in your code, but you got trigger_cls in your paper. I tried to calculate trigger_cls like this:

for pred in pred_tris:
    id_flag = False
    type_flag = False
    for gold_span in self.trigger_span:
        if gold_span[0] == pred[0] and gold_span[1] == pred[1]:
            id_flag = True
            if gold_span[2] == pred[2]:
                type_flag = True
                break
            break
    match_tri_id += int(id_flag)
    match_tri_type += int(type_flag)

But I still get trigger_cls which is the same as trigger_id. How to get trigger_cls correctly?

ej0cl6 commented 1 year ago

Thanks for your interest in our work.

Please notice that the scores reported in the training script are just for selecting the best checkpoint.

To get the real scores, after training the model, you have to use either eval_end2endEE.py or eval_pipelineEE.py for the evaluation. In both scripts, there is a cal_scores function that can calculate trigger_id and trigger_cls. Please refer to https://github.com/PlusLabNLP/DEGREE/blob/666dd8907717d1cb0ea3692867cd7404e892ce54/degree/eval_pipelineEE.py#L121

OStars commented 1 year ago

Thanks a lot! But I am still confused about why we need to run other scripts(eval_end2endEE.py or eval_pipelineEE.py). Is there any difference between real scores and scores reported in the training script?

OStars commented 1 year ago

Thanks a lot! But I am still confused about why we need to run other scripts(eval_end2endEE.py or eval_pipelineEE.py). Is there any difference between real scores and scores reported in the training script?

I think I have an idea. I notice generate_data_degree_xxx.py script also samples negative examples for dev set and test set, which means we only use part of dev set and test set to evaluate during training. So we can either run other scripts to get real scores, or not sample negative examples for dev set and test set in generate_data_degree_xxx.py which will increase much training time. Is that right?

ej0cl6 commented 1 year ago

Yes, to reduce the training time, we use internal evaluation for selecting the best checkpoint.