Overflow error - Githubissues

tomhosking commented 2 years ago

Hi,

During training, I get the following error:

Traceback (most recent call last):
  File "train.py", line 182, in <module>
    generation_save_path=args.generation_save_path)
  File "/disk/nfs/ostrom/s1717552/btmpg/utils/run.py", line 133, in __call__
    self.run()
  File "/disk/nfs/ostrom/s1717552/btmpg/utils/run.py", line 100, in run
    max_length=self.max_length)
  File "/disk/nfs/ostrom/s1717552/btmpg/model/VAE.py", line 206, in round
    out_embed = self.embed(self.GS(sentence[:, -1:, :]))
  File "/disk/nfs/ostrom/s1717552/btmpg/btmpgenv/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk/nfs/ostrom/s1717552/btmpg/model/gumbleSoftmax.py", line 17, in forward
    sigma = min(self.tau_max, (self.tau_max ** (self.n / self.N)))
OverflowError: (34, 'Numerical result out of range')

This happens after a few days of training, around epoch 39 for MSCOCO and epoch 77 for Quora.

The command used was:

python train.py --cuda \
                --train_source ./data/qqp_train.src \
                --train_target ./data/qqp_train.tgt \
                --test_source  ./data/qqp_dev.src \
                --test_target  ./data/qqp_dev.tgt \
                --vocab_path ./checkpoints/qqp.vocab \
                --batch_size 8 \
                --epoch 100 \
                --num_rounds 2 \
                --max_length 50 \
                --clip_length 50 \
                --model_save_path ./checkpoints/qqp.model \
                --generation_save_path ./outputs/qqp/

L-Zhe commented 2 years ago

I am trying to reappear this error and will reply to you soon.

hahally commented 2 years ago

Hi, 这个溢出是因为gumble_softmax的并没有按照论文里说的那样设置。在文件run.py 里面 ‘self.GS = gumble_softmax(3500, 100)‘，即N=3500、Tau_max=100，仔细查看代码会发现，每一步，n+=1，随着训练步数增加，n越来越大，self.tau_max ** (self.n / self.N) 将会出现溢出错误。

Hi, this overflow is because gumble_softmax did not set as mentioned in the paper. In the file run.py, ‘self.GS = gumble_softmax(3500, 100)‘, that is, n = 3500, tau_max = 100, check the code carefully, you will find that every step,n+= 1, with the number of training steps increase, n is getting bigger and bigger, self.tau_max ** (self.n / seld.n) will have an overflow error.

L-Zhe / BTmPG

Overflow error #1