Training Warmup model fails

RichJackson commented 3 years ago

Hi there. The warmup training seems to fail. Error below

Validation sanity check:  50%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                                                                                 | 1/2 [00:00<00:00,  3.88it/s]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 3826/3827 [10:55<00:00,  5.84it/s, best=-inf, loss=1.060]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:02<00:00, 11.42it/s]
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3827/3827 [10:55<00:00,  5.84it/s, best=-inf, loss=1.060]
Epoch 00000: eval_acc reached 0.00000 (best 0.00000), saving model to models/warmup_oie_model/epoch=00_eval_acc=0.000.ckpt as top 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
INFO:lightning:
Epoch 00000: eval_acc reached 0.00000 (best 0.00000), saving model to models/warmup_oie_model/epoch=00_eval_acc=0.000.ckpt as top 1
Epoch 2:   0%|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 0/3827 [00:00<?, ?it/s, best=-inf, loss=1.060]Traceback (most recent call last):
  File "run.py", line 469, in <module>
    main(hyperparams)
  File "run.py", line 459, in main
    train_dataloader, val_dataloader, test_dataloader, all_sentences)
  File "run.py", line 60, in train
    val_dataloaders=val_dataloader)
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 638, in run_training_batch
    self.on_batch_end()
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 63, in on_batch_end
    callback.on_batch_end(self, self.get_model())
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/callbacks/progress.py", line 326, in on_batch_end
    self.main_progress_bar.set_postfix(**trainer.progress_bar_dict)
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 750, in progress_bar_dict
    return dict(**ref_model.get_progress_bar_dict(), **self.progress_bar_metrics)
  File "/home//openie6/model.py", line 263, in get_progress_bar_dict
    best = self.trainer.checkpoint_callback.kth_value.item()
AttributeError: 'int' object has no attribute 'item'
Exception ignored in: <function tqdm.__del__ at 0x7ff34b852b90>
Traceback (most recent call last):
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/tqdm/std.py", line 1293, in close
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/tqdm/std.py", line 1471, in display
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home//.conda/envs/openie_3/lib/python3.7/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object

SaiKeshav commented 3 years ago

The code seems to break because the model gets an accuracy of 0.0 at end of 1st epoch. I will push the corrected code to avoid this. But I am concerned that about 0% accuracy at the first place. Are you using your custom data?

RichJackson commented 3 years ago

Hi there. Thanks for helping me out!

I'm trying to reproduce the models as described in the original paper, so am following the instructions in the readme, and therefore using the original data.

I'm running with the command:

python run.py --save models/warmup_oie_model --mode train_test --model_str bert-base-cased --task oie --epochs 30 --gpus 1 --batch_size 24 --optimizer adamW --lr 2e-05 --iterative_layers 2

Note, I had to make a few fixes to the requirements.txt in order to get the code to run. My current environment looks like:

absl-py==0.9.0
aiohttp==3.7.4.post0
alabaster==0.7.12
allennlp===0.9.0-unreleased
astroid==1.6.6
async-timeout==3.0.1
attrs==21.2.0
Babel==2.9.1
backcall==0.2.0
bleach==3.3.0
blis==0.4.1
boto3==1.10.45
botocore==1.13.45
cached-property==1.5.2
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.6.20
cffi==1.14.5
chardet==3.0.4
click==7.1.2
codecov==2.1.11
colorama==0.4.4
conllu==1.3.1
coverage==5.5
cryptography==3.4.7
cycler==0.10.0
cymem==2.0.3
decorator==4.4.2
docopt==0.6.2
docutils==0.15.2
editdistance==0.5.3
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.0/en_core_web_sm-2.3.0.tar.gz
filelock==3.0.12
flaky==3.6.1
Flask==1.1.1
Flask-Cors==3.0.8
ftfy==5.6
future==0.18.2
gevent==1.4.0
google-auth==1.18.0
google-auth-oauthlib==0.4.1
greenlet==1.1.0
grpcio==1.30.0
h5py==2.10.0
idna==2.10
imageio==2.8.0
imagesize==1.2.0
importlib-metadata==1.7.0
ipdb==0.13.9
ipython==7.16.1
ipython-genutils==0.2.0
isort==5.8.0
itsdangerous==2.0.1
jedi==0.17.1
jeepney==0.6.0
Jinja2==3.0.1
jmespath==0.10.0
joblib==0.15.1
jsonnet @ file:///home/conda/feedstock_root/build_artifacts/jsonnet_1606064680848/work
jsonpickle==1.2
keyring==23.0.1
kiwisolver==1.3.1
lazy-object-proxy==1.6.0
livereload==2.6.3
Markdown==3.2.2
MarkupSafe==2.0.1
matplotlib==3.1.2
matplotlib-inline==0.1.2
mccabe==0.6.1
more-itertools==8.8.0
multidict==5.1.0
murmurhash==1.0.2
mypy==0.521
nltk==3.5
numpy==1.19.0
numpydoc==0.9.2
oauthlib==3.1.0
overrides==3.1.0
packaging==20.4
pandas==1.0.5
parsimonious==0.8.1
parso==0.7.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.1.2
pkginfo==1.7.0
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prompt-toolkit==3.0.5
protobuf==3.12.2
ptyprocess==0.6.0
py==1.10.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
Pygments==2.6.1
pylint==1.9.4
pypandoc==1.5
pyparsing==2.4.7
pytest==5.3.2
pytest-cov==2.12.1
python-dateutil==2.8.1
pytorch-lightning==0.7.6
pytorch-pretrained-bert==0.6.2
pytorch-transformers @ file:///home/user/openie6/imojie/pytorch_transformers
pytz==2020.1
PyYAML==5.3.1
readme-renderer==29.0
regex==2020.6.8
requests==2.24.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
responses==0.10.9
rfc3986==1.5.0
rsa==4.6
s3transfer==0.2.1
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.0
SecretStorage==3.3.1
sentencepiece==0.1.91
six==1.15.0
snowballstemmer==2.1.0
spacy==2.3.0
Sphinx==2.3.1
sphinx-autobuild==2021.3.14
sphinx-rtd-theme==0.5.2
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
sqlparse==0.3.0
srsly==1.0.2
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorboardX==1.9
thinc==7.4.1
threadpoolctl==2.1.0
tokenizers==0.5.2
toml==0.10.2
toolz==0.11.1
torch==1.6.0
torchtext==0.7.0
tornado==6.1
tqdm==4.47.0
traitlets==4.3.3
transformers==2.6.0
twine==3.4.1
typed-ast==1.0.4
typing-extensions==3.10.0.0
Unidecode==1.1.1
urllib3==1.25.9
wasabi==0.7.0
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
wget==3.2
word2number==1.1
wrapt==1.12.1
yarl==1.6.3
zenodo-get==1.3.0
zipp==3.1.0
zope.event==4.5.0
zope.interface==5.4.0

I can also confirm that training for longer still results in an eval_f1 of 0:

Validation sanity check: 100%|██████████| 2/2 [00:00<00:00,  2.82it/s]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}
Epoch 1: 100%|██████████| 3827/3827 [11:57<00:00,  5.33it/s, loss=1.060, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 2: 100%|█████████▉| 3826/3827 [12:40<00:00,  5.03it/s, loss=1.088, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 3: 100%|█████████▉| 3826/3827 [12:55<00:00,  4.94it/s, loss=0.817, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 4: 100%|█████████▉| 3826/3827 [13:02<00:00,  4.89it/s, loss=0.832, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 5: 100%|█████████▉| 3826/3827 [13:09<00:00,  4.84it/s, loss=0.641, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 6: 100%|█████████▉| 3826/3827 [13:14<00:00,  4.82it/s, loss=0.651, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 7: 100%|█████████▉| 3826/3827 [13:11<00:00,  4.83it/s, loss=0.536, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 8: 100%|█████████▉| 3826/3827 [13:18<00:00,  4.79it/s, loss=0.480, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 9: 100%|█████████▉| 3826/3827 [13:16<00:00,  4.80it/s, loss=0.393, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 10: 100%|█████████▉| 3826/3827 [13:13<00:00,  4.82it/s, loss=0.438, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 11: 100%|█████████▉| 3826/3827 [13:20<00:00,  4.78it/s, loss=0.388, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 12: 100%|█████████▉| 3826/3827 [13:16<00:00,  4.81it/s, loss=0.369, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 13: 100%|█████████▉| 3826/3827 [13:18<00:00,  4.79it/s, loss=0.295, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 14: 100%|█████████▉| 3826/3827 [13:20<00:00,  4.78it/s, loss=0.319, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 15: 100%|█████████▉| 3826/3827 [13:16<00:00,  4.80it/s, loss=0.256, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 16: 100%|█████████▉| 3826/3827 [13:17<00:00,  4.80it/s, loss=0.263, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 17: 100%|█████████▉| 3826/3827 [13:20<00:00,  4.78it/s, loss=0.241, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 18: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.77it/s, loss=0.185, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 19: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.77it/s, loss=0.184, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 20: 100%|█████████▉| 3826/3827 [13:22<00:00,  4.77it/s, loss=0.230, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 21: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.77it/s, loss=0.225, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 22: 100%|█████████▉| 3826/3827 [13:19<00:00,  4.78it/s, loss=0.218, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 23: 100%|█████████▉| 3826/3827 [13:22<00:00,  4.77it/s, loss=0.147, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 24: 100%|█████████▉| 3826/3827 [13:23<00:00,  4.76it/s, loss=0.217, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 25: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.77it/s, loss=0.162, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 26: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.78it/s, loss=0.143, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 27: 100%|█████████▉| 3826/3827 [13:22<00:00,  4.77it/s, loss=0.141, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 28: 100%|█████████▉| 3826/3827 [13:22<00:00,  4.77it/s, loss=0.114, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 29: 100%|█████████▉| 3826/3827 [13:20<00:00,  4.78it/s, loss=0.133, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 30: 100%|█████████▉| 3826/3827 [13:21<00:00,  4.77it/s, loss=0.120, v_num=train.part]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}/s]
Epoch 30: 100%|██████████| 3827/3827 [13:21<00:00,  4.77it/s, loss=0.120, v_num=train.part]
Testing: 100%|██████████| 27/27 [00:02<00:00, 12.09it/s]   
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}
--------------------------------------------------------------------------------
TEST RESULTS
{'eval_auc': 0, 'eval_f1': 0, 'eval_lastf1': 0, 'test_acc': 0}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 27/27 [00:02<00:00, 10.44it/s]
Testing: 100%|██████████| 27/27 [00:02<00:00, 12.15it/s]
Results: {'eval_f1': 0, 'eval_auc': 0, 'eval_lastf1': 0}
--------------------------------------------------------------------------------
TEST RESULTS
{'eval_auc': 0, 'eval_f1': 0, 'eval_lastf1': 0, 'test_acc': 0}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 27/27 [00:02<00:00, 10.46it/s]

The loss seems to be decreasing as expected, so perhaps there's a problem with the evaluation code? Any help greatly appreciated!

RichJackson commented 3 years ago

I just ran a quick test with your pretrained model (i.e. just the evaluation part without training)

python run.py --save models/warmup_oie_model --mode test --model_str bert-base-cased --task oie --epochs 30 --gpus 1 --batch_size 24 --optimizer adamW --lr 2e-05 --iterative_layers 2

This also results in an eval_f1 of 0.0. However, running with --mode predict seems to give the expected results, so this suggests the evaluation code isn't working correctly?

SaiKeshav commented 3 years ago

Hello, I have re-run the steps from README in a fresh environment (Installation/Download Resources/Testing warmup model) and I am able to replicate the scores perfectly in test mode. I checked the important libraries in your current environment and they seem to match. What was issue you found with the original requirements.txt? What exactly did you have to change? That may give some insight into this.

RichJackson commented 3 years ago

Closing this now. Not quite sure what I was doing wrong, but can confirm code is performing as expected now

dair-iitd / openie6

Training Warmup model fails #9