开始 landmark_generator_training**
Project_name: landmarks
Load checkpoint from: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
Load optimizer state from ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49644/49644 [00:04<00:00, 11383.00it/s]
complete,with available vids: 49475
init dataset,filtering very short videos.....
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11265.29it/s]
complete,with available vids: 9976
0%| | 0/30 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch1166_step000035000.pth
Evaluating model for 25 epochs
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.62s/it]
eval_L1_loss 0.005300633320584894 global_step: 35000█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.63s/it]
eval_velocity_loss 0.04097183309495449 global_step: 35000
0%| | 0/30 [02:56<?, ?it/s]
Traceback (most recent call last):
File "train_landmarks_generator.py", line 341, in
optimizer.step()
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 252, in step
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 316, in adam
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 363, in _single_tensor_adam
expavg.mul(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
我在训练 landmark_generator时,为了能够从上次中断的地方继续训练,发现调用 load_checkpoint()传入的参数 reset_optimizer=False时,会出现下面的错误
开始 landmark_generator_training** Project_name: landmarks Load checkpoint from: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth Load optimizer state from ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch_1166_checkpoint_step000035000.pth init dataset,filtering very short videos..... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49644/49644 [00:04<00:00, 11383.00it/s] complete,with available vids: 49475
init dataset,filtering very short videos..... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11265.29it/s] complete,with available vids: 9976
0%| | 0/30 [00:00<?, ?it/s]Saved checkpoint: ./checkpoints/landmark_generation/Pro_landmarks/landmarks_epoch1166_step000035000.pth Evaluating model for 25 epochs 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.62s/it] eval_L1_loss 0.005300633320584894 global_step: 35000█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:45<00:00, 6.63s/it] eval_velocity_loss 0.04097183309495449 global_step: 35000 0%| | 0/30 [02:56<?, ?it/s] Traceback (most recent call last): File "train_landmarks_generator.py", line 341, in
optimizer.step()
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, args, **kwargs)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 252, in step
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 316, in adam
found_inf=found_inf)
File "/opt/anaconda3/envs/iplap_py37/lib/python3.7/site-packages/torch/optim/adam.py", line 363, in _single_tensor_adam
expavg.mul(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
如果改为 reset_optimizer=True,可以正常训练,但是生成的 pth文件大小和之前不太一样,不知道为啥会相差十几K,这个正常吗
-rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 02:20 landmarks_epoch_166_checkpoint_step000005000.pth -rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 03:40 landmarks_epoch_333_checkpoint_step000010000.pth -rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 05:00 landmarks_epoch_500_checkpoint_step000015000.pth -rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 06:21 landmarks_epoch_666_checkpoint_step000020000.pth -rw-rw-r-- 1 tailangjun tailangjun 167279907 1月 2 07:41 landmarks_epoch_833_checkpoint_step000025000.pth -rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 09:01 landmarks_epoch_1000_checkpoint_step000030000.pth -rw-rw-r-- 1 tailangjun tailangjun 167281157 1月 2 10:21 landmarks_epoch_1166_checkpoint_step000035000.pth
调用 load_checkpoint() 传入 reset_optimizer=True
-rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:36 landmarks_epoch1199_step000036000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 13:54 landmarks_epoch1232_step000037000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:01 landmarks_epoch1266_step000038000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:19 landmarks_epoch1299_step000039000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:38 landmarks_epoch1332_step000040000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 15:56 landmarks_epoch1366_step000041000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:15 landmarks_epoch1399_step000042000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:33 landmarks_epoch1432_step000043000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 16:51 landmarks_epoch1466_step000044000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:10 landmarks_epoch1499_step000045000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:28 landmarks_epoch1532_step000046000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 17:47 landmarks_epoch1566_step000047000.pth -rw-rw-r-- 1 tailangjun tailangjun 167261549 1月 2 18:05 landmarks_epoch1599_step000048000.pth