The tensor output by self.vertice_mapping in the TransformerEncoder of stage1 is all nan

FortisCK commented 1 year ago

Generally, when training to the second epoch, the output results are all nan. At this time, I check the bias and weight of the linear layer, and the results are all nan.

self.encoder.vertice_mapping[0]
Linear(in_features=15069, out_features=1024, bias=True)
self.encoder.vertice_mapping[0].bias
Parameter containing:
tensor([nan, nan, nan,  ..., nan, nan, nan], device='cuda:0',
       requires_grad=True)
self.encoder.vertice_mapping[0].weight
Parameter containing:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       requires_grad=True)

Doubiiu commented 1 year ago

I have not encountered this before. Are you using the default config for training? It may be solved by scaling down the learning rate I guess?

youngstu commented 1 year ago

I has the same problem.

[2023-04-21 17:04:49,310 INFO train_vq.py line 181 11368]=>Epoch: [1/200][70/314] Data: 0.024 (0.035) Batch: 0.109 (0.121) Remain: 02:06:09 Loss: 0.1313 
[2023-04-21 17:04:50,289 INFO train_vq.py line 181 11368]=>Epoch: [1/200][80/314] Data: 0.026 (0.033) Batch: 0.130 (0.118) Remain: 02:03:08 Loss: 0.1381 
[2023-04-21 17:04:51,143 INFO train_vq.py line 181 11368]=>Epoch: [1/200][90/314] Data: 0.029 (0.033) Batch: 0.130 (0.114) Remain: 01:59:22 Loss: 0.1342 
[2023-04-21 17:04:51,857 INFO train_vq.py line 181 11368]=>Epoch: [1/200][100/314] Data: 0.024 (0.032) Batch: 0.063 (0.110) Remain: 01:54:52 Loss: 0.1323 
[2023-04-21 17:04:52,757 INFO train_vq.py line 181 11368]=>Epoch: [1/200][110/314] Data: 0.026 (0.031) Batch: 0.066 (0.108) Remain: 01:52:58 Loss: 0.1308 
[2023-04-21 17:04:53,606 INFO train_vq.py line 181 11368]=>Epoch: [1/200][120/314] Data: 0.025 (0.031) Batch: 0.072 (0.106) Remain: 01:50:53 Loss: 0.1322 
[2023-04-21 17:04:54,501 INFO train_vq.py line 181 11368]=>Epoch: [1/200][130/314] Data: 0.024 (0.030) Batch: 0.071 (0.105) Remain: 01:49:34 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2023-04-21 17:04:55,388 INFO train_vq.py line 181 11368]=>Epoch: [1/200][140/314] Data: 0.024 (0.030) Batch: 0.076 (0.104) Remain: 01:48:20 Loss: nan 
INFO:main-logger:Epoch: [1/200][140/314] Data: 0.024 (0.030) Batch: 0.076 (0.104) Remain: 01:48:20 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2023-04-21 17:04:56,192 INFO train_vq.py line 181 11368]=>Epoch: [1/200][150/314] Data: 0.024 (0.029) Batch: 0.071 (0.102) Remain: 01:46:41 Loss: nan 
INFO:main-logger:Epoch: [1/200][150/314] Data: 0.024 (0.029) Batch: 0.071 (0.102) Remain: 01:46:41 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.

Doubiiu / CodeTalker

The tensor output by self.vertice_mapping in the TransformerEncoder of stage1 is all nan #10