Closed Aatlantise closed 3 years ago
Not sure, but this may be related....
In line 81 of run.py
, trainer
is called with arguments from hparams
, instead of loaded_hparams
with the updated arguments, with which the model
is declared.
Is this discrepancy intended?
你好,
感谢您分享您的工作。我正在尝试自己训练模型,但被困在受约束的模型步骤:
python run.py --save models/oie_model --mode resume --model_str bert-base-cased --task oie --epochs 16 --gpus 1 --batch_size 16 --optimizer adam --lr 5e-06 --iterative_layers 2 --checkpoint models/warmup_oie_model/epoch=15_eval_acc=0.485.ckpta --constraints posm_hvc_hvr_hve --save_k 3 --accumulate_grad_batches 2 --gradient_clip_val 1 --multi_opt --lr 2e-5 --wreg 1 --cweights 3_3_3_3 --val_check_interval 0.1
看起来模型没有在“恢复”模式下训练。验证健全性检查已通过,但正如
trainer.fit()
所调用的那样,它立即退出并显示以下日志输出:Validation sanity check: 100%|##########################################| 5/5 [00:00<00:00, 7.83it/s] Results: {'eval_f1': 0.094, 'eval_auc': 0.0362, 'eval_lastf1': 0.094} Training: 0it [00:00, ?it/s]
TFevents 文件提供了以下信息:
Processing event files... (this can take a few minutes) ====================================================================== These tags are in events.out.tfevents.1634007290.34a1b7eb1401.3870.0: audio - histograms - images - scalars - tensor - ====================================================================== Event statistics for events.out.tfevents.1634007290.34a1b7eb1401.3870.0: audio - graph - histograms - images - scalars - sessionlog:checkpoint - sessionlog:start - sessionlog:stop - tensor - ======================================================================
使用
--mode train_test
,相同的命令和参数成功训练。我很感激我能得到的任何帮助。谢谢!
Hello , I also encountered the same problem and did not resume training after using --mode train_test
I'm also confused on how the rescore model is produced. It looks like the rescore model is required for the final model (to recalculate confidences and thus AUC), but only the warmup, constraint, and conjunction models seem to be trained in your tutorial.
Any insight would be appreciated!
I've discovered this was an error on my part. I was resuming from a warmup model of my own, which already had 16 epochs in its book. Calling resume
with --epoch 16
thus quit with training, because it was already at epoch 15.
Calling the same command with --epoch 20
, I was able to continue training.
Perhaps @123zzw, you are having the same issue?
I'm closing this issue, but I would still welcome any insight regarding the rescore model. :)
Hello,
Thank you for sharing you work. I am trying to train the model myself but am stuck at the constrained model step:
It looks like the model doesn't train under 'resume' mode. Validation sanity check is passed, but as
trainer.fit()
is called, it immediately exits with the following log output:The TFevents file offers the uninformative following:
With
--mode train_test
, the same command and parameter successfully trains. I'd appreciate any help I can get.Thank you!