IBM / transition-amr-parser

SoTA Abstract Meaning Representation (AMR) parsing with word-node alignments in Pytorch. Includes checkpoints and other tools such as statistical significance Smatch.
Apache License 2.0
231 stars 46 forks source link

Infinite loss when fine-tuning #50

Open Spongeorge opened 1 year ago

Spongeorge commented 1 year ago

I'm trying to fine-tune the AMR3.0 large SBART checkpoint on another dataset, but during training I get the following warnings:

2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | INFO | train | {"epoch": 1, "train_loss": "inf", "train_nll_loss": "inf", "train_loss_seq": "inf", "train_nll_loss_seq": "inf", "train_loss_pos": "0.710562", "train_nll_loss_pos": "0.710562", "train_wps": "687.9", "train_ups": "0.51", "train_wpb": "1354.7", "train_bsz": "55.2", "train_num_updates": "71", "train_lr": "1.87323e-06", "train_gnorm": "17.868", "train_loss_scale": "8", "train_train_wall": "45", "train_wall": "158"}

In my config I set the fairseq-preprocess arguments as:

FAIRSEQ_PREPROCESS_FINETUNE_ARGS="--srcdict /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/dict.en.txt --tgtdict /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/dict.actions_nopos.txt"

and train args as:

FAIRSEQ_TRAIN_FINETUNE_ARGS="--finetune-from-model /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/checkpoint_wiki.smatch_top5-avg.pt --memory-efficient-fp16 --batch-size 16 --max-tokens 512 --patience 10"

Any ideas as to what I'm doing wrong? Thanks in advance.

Spongeorge commented 1 year ago

Output from tests/correctly_installed.sh

pytorch 1.10.1+cu102
cuda 10.2
Apex not installed
smatch installed
pytorch-scatter installed
fairseq works
[OK] correctly installed

I also tried with the wiki25 dataset downloaded in tests/minimal_test.sh and got the same issue, infinite loss in both training and validation, so I don't think its an issue with my input. During tests/minimal_test.sh the loss isn't infinite, though.