Evaluate predictions from the models

alan-turing-institute / ARC-MTQE

Critical Error Detection for Machine Translation

MIT License

1 stars 0 forks source link

Evaluate predictions from the models #65

Closed joannacknight closed 6 months ago

joannacknight commented 6 months ago

We have the ability to make predictions using an existing checkpoint. These predictions are the output of the model without the sigmoid activation function being applied. We need to add functionality to turn these outputs into binary predictions and calculate the MCC and other metrics.

joannacknight commented 6 months ago

What do we want as the output of the evaluation script?

The predictions files will contain individual predictions for each record (in the dev or test dataset - as appropriate) for both logits and also the score with the activation function applied.

Should the evaluation just output the MCC, precision, recall, accuracy, and F1?

joannacknight commented 6 months ago

As discussed this morning, evaluation includes the metrics MCC, precision, recall, accuracy and F1. Will add plots at a later stage if we want them for the report.

@radka-j - the code is ready for your review, then can merge into main, close this down and move onto issue #60

joannacknight commented 6 months ago

Some extra functionality I think we need - to discuss:

Need to evaluate by random seed and / or over all random seeds
Probably need to make predictions via a slurm script (given the size of the models we won't want to create predictions on the login node, but evaluation script should be fine)

radka-j commented 6 months ago

We also need to do evaluation for different binarisation thresholds (picking best threshold on the validation data)

joannacknight commented 6 months ago

We have a script to do this now for a threshold of 0.5.

I've created #84 to follow on from this to evaluate over different thresholds.

I'll log progress of generating test results in #60