Unable to train constraint model

Aatlantise commented 3 years ago

Hello,

Thank you for sharing you work. I am trying to train the model myself but am stuck at the constrained model step:

python run.py --save models/oie_model --mode resume --model_str bert-base-cased --task oie --epochs 16 --gpus 1 --batch_size 16 --optimizer adam --lr 5e-06 --iterative_layers 2 --checkpoint models/warmup_oie_model/epoch=15_eval_acc=0.485.ckpta --constraints posm_hvc_hvr_hve --save_k 3 --accumulate_grad_batches 2 --gradient_clip_val 1 --multi_opt --lr 2e-5 --wreg 1 --cweights 3_3_3_3 --val_check_interval 0.1

It looks like the model doesn't train under 'resume' mode. Validation sanity check is passed, but as trainer.fit() is called, it immediately exits with the following log output:

Validation sanity check: 100%|##########################################| 5/5 [00:00<00:00,  7.83it/s]
Results: {'eval_f1': 0.094, 'eval_auc': 0.0362, 'eval_lastf1': 0.094}
Training: 0it [00:00, ?it/s]

The TFevents file offers the uninformative following:

Processing event files... (this can take a few minutes)
======================================================================

These tags are in events.out.tfevents.1634007290.34a1b7eb1401.3870.0:
audio -
histograms -
images -
scalars -
tensor -
======================================================================

Event statistics for events.out.tfevents.1634007290.34a1b7eb1401.3870.0:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor -
======================================================================

With --mode train_test, the same command and parameter successfully trains. I'd appreciate any help I can get.

Thank you!

Aatlantise commented 3 years ago

Not sure, but this may be related....

In line 81 of run.py, trainer is called with arguments from hparams, instead of loaded_hparams with the updated arguments, with which the model is declared.

Is this discrepancy intended?

123zzw commented 3 years ago

你好，

感谢您分享您的工作。我正在尝试自己训练模型，但被困在受约束的模型步骤：

python run.py --save models/oie_model --mode resume --model_str bert-base-cased --task oie --epochs 16 --gpus 1 --batch_size 16 --optimizer adam --lr 5e-06 --iterative_layers 2 --checkpoint models/warmup_oie_model/epoch=15_eval_acc=0.485.ckpta --constraints posm_hvc_hvr_hve --save_k 3 --accumulate_grad_batches 2 --gradient_clip_val 1 --multi_opt --lr 2e-5 --wreg 1 --cweights 3_3_3_3 --val_check_interval 0.1

看起来模型没有在“恢复”模式下训练。验证健全性检查已通过，但正如trainer.fit()所调用的那样，它立即退出并显示以下日志输出：

Validation sanity check: 100%|##########################################| 5/5 [00:00<00:00,  7.83it/s]
Results: {'eval_f1': 0.094, 'eval_auc': 0.0362, 'eval_lastf1': 0.094}
Training: 0it [00:00, ?it/s]

TFevents 文件提供了以下信息：

Processing event files... (this can take a few minutes)
======================================================================

These tags are in events.out.tfevents.1634007290.34a1b7eb1401.3870.0:
audio -
histograms -
images -
scalars -
tensor -
======================================================================

Event statistics for events.out.tfevents.1634007290.34a1b7eb1401.3870.0:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor -
======================================================================

使用--mode train_test，相同的命令和参数成功训练。我很感激我能得到的任何帮助。

谢谢！

Hello , I also encountered the same problem and did not resume training after using --mode train_test

Aatlantise commented 3 years ago

I'm also confused on how the rescore model is produced. It looks like the rescore model is required for the final model (to recalculate confidences and thus AUC), but only the warmup, constraint, and conjunction models seem to be trained in your tutorial.

Any insight would be appreciated!

Aatlantise commented 3 years ago

I've discovered this was an error on my part. I was resuming from a warmup model of my own, which already had 16 epochs in its book. Calling resume with --epoch 16 thus quit with training, because it was already at epoch 15.

Calling the same command with --epoch 20, I was able to continue training.

Perhaps @123zzw, you are having the same issue?

I'm closing this issue, but I would still welcome any insight regarding the rescore model. :)

dair-iitd / openie6

Unable to train constraint model #12