Replicating REBEL from BART and some issues

jefflink commented 1 year ago

Hi, thank you for the very interesting work that you have done! I'm trying to replicate the your training process based on the train.py and the _defaultmodel configuration, to reach the state of your REBEL model. However I ran into some issues and would like to seek your help.

I notice the following warning based on the torch version of 1.9.1. Do I need to do anything or downgrade torch? or what is the torch version that you are using?

/python3.7/site-packages/pytorch_lightning/plugins/native_amp.py:65: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip_val, norm_type=norm_type)

I started with training using the docred data and train. However, I notice that after the 8th epoch, I no longer get the correct answer and the val accuracy has an immediate steep drop to 0

Epoch 8:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 752/790 [02:51<00:08,  4.39it/s, loss=0.316, v_num=dgll]
processed 300 sentences with 3515 relations; found: 987 relations; correct 520.
Epoch 8, global step 845: val_F1_micro reached 23.17290 (best 23.17290)

Epoch 9:  95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████      | 752/790 [02:51<00:08,  4.38it/s, loss=1.71, v_num=dgll]
processed 300 sentences with 3515 relations; found: 5986 relations; correct 1.
Epoch 9, step 939: val_F1_micro was not in top 3

Epoch 10:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 752/790 [02:50<00:08,  4.40it/s, loss=1.44, v_num=dgll]
processed 300 sentences with 3515 relations; found: 0 relations; correct: 0.
Epoch 10, step 1033: val_F1_micro was not in top 3

Epoch 11:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 752/790 [02:49<00:08,  4.44it/s, loss=1.31, v_num=dgll]
processed 300 sentences with 3515 relations; found: 1800 relations; correct 0.
Epoch 11, step 1127: val_F1_micro was not in top 3

Lastly, is there a sequence that you train the various data from to achieve REBEL from BART original state?

Thank you!

jefflink commented 1 year ago

Managed to resolved the first 2 errors by downgrading to torch 1.8.1.

LittlePea13 commented 1 year ago

Sorry for the late reply. The code is indeed a bit "outdated" since both torch, Pytorch Lightning and transformers have seen several updates that may break the code. Hopefully it wasn't too much of an issue. I may try to find some time to update everything, but no promises.

Regarding the sequence to train, there isn't much to it. Just training BART with the REBEL dataset should do it. Depending on your hardware that may take a while, since the model is big and there's a lot of data instances. Make sure to use the default config files, such as default_data.yaml. This will ensure the model only trains on the 230 most frequent relations, which is how REBEL was trained.

Best, Pere-Lluis

Babelscape / rebel

Replicating REBEL from BART and some issues #59