Models trained with subset of ChemProt data unpredictably make large quantities of spurious predictions

serenalotreck commented 1 year ago

I'm running a quick analysis to evaluate the effect of training corpus size on model performance on a fixed test set. The analysis is performed as follows:

Choose a set of 30 test docs and 20 val docs
Choose an initial training set of 150 docs
Over n (here I've been using 7, to get to a 500 doc train set) iterations, add 50 more docs to the training set
For each of those training sets, train a model and evaluate on the test and validation sets

Observed behavior: For some of the models, there is near-0 NER performance, and 0 relation performance -- but this doesn't correlate with training set size. Additionally, results on validation set as reported from the model a re completely different than those obtained with allennlp evaluate

An example run's performances (calculated externally to the model with my own code, but I get basically the same results with allennlp evaluate):

    rel_F1  docnum
0   0.264448    150
1   0.380308    200
2   0.364521    250
3   0.000000    300
4   0.459839    350
5   0.394745    400
6   0.000000    450
7   0.427195    500

Reported validation set performance (best_validation_MEAN__relation_f1 from metrics.json in the model forlder) for the 0 models is ~0.4, which is on par with the rest of the models. However, if I call allennlp evaluate on the dev set, I also get and F1 of 0.

Other observations:

When I look at the prediction files, it looks like the cause of the poor entity performance is that the model is making insane numbers of spurious entity predictions, almost predicting an entity on every single word of each doc, and for relations, is that all docs by 1 have no relation predictions at all
Which number of documents results in 0 performance changes when I re-run the analysis selecting new docs, but there is usually one or two that have 0 performance

Do you have any intuition for what might be going on here? To me it seems like it's possibly something in allennlp that fails catastrophically on smaller numbers of documents in an unpredictable manner, but I'd love to know your thoughts.

EDIT: on closer inspection, it looks like the model is predicting an entity on every possible span

dwadden commented 1 year ago

Hmm, this is bizarre. So, it sounds like there are at least two issues:

Training goes really badly sometimes.
The evaluation code doesn't work.

Is that right? Unfortunately, I can't offer much help; at this point I'm trying to maintain DyGIE to make sure it can do the stuff from the original paper, but that's about it. AllenNLP has also been retired at this point. Two ideas:

Have you monitored the gradient norms and loss function during training? If you see big spikes in the runs that lead to bad performance, you can try to figure out what's causing the spikes. Is it a particular training instance, for example?
For allennlp evaluate, it sounds like the results disagree with what you're getting when you write the eval code yourself. Are you able to localize a bug in the eval code?

serenalotreck commented 1 year ago

Thanks for your thoughts!

I haven't located a bug in my own eval code; and I feel pretty confidently that it's related to model training, for three reasons: (1) I've tested my eval code pretty extensively, (2) I only see the issue on ChemProt, and when I look at the prediction files that I feed to the eval code, they look totally bonkers, with a prediction on every possible span, so the eval results align with what I would expect from prediction files like those, and (3) the results between the DyGIE++ metrics.json for the dev set and allennlp evaluate on the dev set are totally different (~0.4 reported with model training and 0 reported from allennlp evaluate).

I'll take a look at the gradient norms and loss functions, I haven't done that yet, and I have a feeling that you're totally right about there being some particular training instance (or set of instances) that cause the issue, since it doesn't happen every time I run the analysis.

Thanks again, I'll let you know what I figure out!

dwadden commented 1 year ago

Sounds good, good luck debugging!

dwadden / dygiepp

Models trained with subset of ChemProt data unpredictably make large quantities of spurious predictions #114