Machine-Learning-for-Medical-Language / cnlp_transformers

Transformers for Clinical NLP
https://cnlp-transformers.readthedocs.io/en/stable/
Apache License 2.0
21 stars 17 forks source link

Make sure eval_one_score metric is saved for Trainer #190

Closed mikix closed 8 months ago

mikix commented 1 year ago

transformers.Trainer was erroring out because it was looking for the eval_one_score metric, which we never saved for it.

This sets it in the generated metrics for Trainer to find.

Traceback I was seeing

At the end of training:

Traceback (most recent call last):
  File "/programs/x86_64-linux/python/3.9.16/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/programs/x86_64-linux/python/3.9.16/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/lib/python3.9/site-packages/cnlp/train_system.py", line 556, in <module>
    main()
  File "/root/lib/python3.9/site-packages/cnlpt/train_system.py", line 466, in main
    trainer.train(
  File "/root/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/root/lib/python3.9/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/lib/python3.9/site-packages/transformers/trainer.py", line 2265, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/root/lib/python3.9/site-packages/transformers/trainer.py", line 2377, in _save_checkpoint
    metric_value = metrics[metric_to_check]
KeyError: 'eval_one_score'

Is this the correct fix?

I'm not super familiar with this code path, but this seemed to be appropriate.

tmills commented 1 year ago

I think this will fix the exception (and produce reasonable behavior) but something else weird is going on -- we're probably doubly saving the model because it's happening inside the trainer api and in our code. we'll dig in to this a little bit.

tmills commented 8 months ago

I'm checking in on old PRs. I wonder if this is still an issue in v0.7.0? @mikix do you still remember the setup that caused this error? I'm wondering if I can try to replicate with a public dataset to try to track it down.

mikix commented 8 months ago

I'm sorry I don't remember what I was working on when I hit this :frowning_face: