Open shuodehaoa opened 1 week ago
That is an optional option, not required at the beginning of training.
2024-06-25 21:50:50,403 [INFO ] Logging file is /nas/project/2024-MambaVC/checkpoints/0.05/20240625_215050.log
2024-06-25 21:50:50,404 [INFO ] model:bmshj2018-factorized
2024-06-25 21:50:50,404 [INFO ] dataset:/nas/dataset/QP22/Split/mambaVC
2024-06-25 21:50:50,404 [INFO ] epochs:500
2024-06-25 21:50:50,404 [INFO ] learning_rate:0.0001
2024-06-25 21:50:50,404 [INFO ] num_workers:128
2024-06-25 21:50:50,404 [INFO ] lmbda:0.05
2024-06-25 21:50:50,404 [INFO ] batch_size:8
2024-06-25 21:50:50,404 [INFO ] test_batch_size:1
2024-06-25 21:50:50,404 [INFO ] aux_learning_rate:0.001
2024-06-25 21:50:50,404 [INFO ] patch_size:(256, 256)
2024-06-25 21:50:50,404 [INFO ] cuda:True
2024-06-25 21:50:50,404 [INFO ] save:True
2024-06-25 21:50:50,404 [INFO ] seed:42
2024-06-25 21:50:50,404 [INFO ] clip_max_norm:1.0
2024-06-25 21:50:50,404 [INFO ] checkpoint:None
2024-06-25 21:50:50,404 [INFO ] type:mse
2024-06-25 21:50:50,404 [INFO ] save_path:/nas/project/2024-MambaVC/checkpoints
2024-06-25 21:50:50,404 [INFO ] skip_epoch:0
2024-06-25 21:50:50,404 [INFO ] N:128
2024-06-25 21:50:50,404 [INFO ] lr_epoch:[450, 490]
2024-06-25 21:50:50,404 [INFO ] continue_train:False
cuda
/home/gao/anaconda3/envs/mambaVC/lib/python3.8/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 128 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
milestones: [450, 490]
0%| | 0/500 [00:00<?, ?it/s]2024-06-25 21:50:51,487 [INFO ] ======Current epoch 0 ======
2024-06-25 21:50:51,487 [INFO ] Learning rate: 0.0001
0it [00:00, ?it/s]
0%| | 0/500 [00:03<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 466, in
Invoked with: tensor([[[ 2.5545e-01, 3.2468e-01, 3.1105e-01, ..., 1.1445e-01, 1.1466e-01, 1.1707e-01], [ 2.2994e-01, 2.8560e-01, 2.1514e-01, ..., 2.9078e-01, 3.4029e-01, 3.0308e-01], [ 7.6905e-02, 8.9864e-02, 1.0398e-01, ..., -6.8421e-02, -7.0336e-02, -3.3716e-02], ..., [ 2.7783e-02, 6.0757e-02, 1.5139e-01, ..., 7.5845e-02, 1.5906e-02, 9.4314e-02], [-1.2216e-02, 7.2406e-02, 1.3762e-01, ..., 2.0739e-02, 1.8137e-02, 1.6718e-02], [ 6.8043e-02, 8.0412e-02, 5.6260e-02, ..., 3.7458e-02, 2.4173e-02, 4.2964e-02]],
[[ 2.6391e-01, 3.4885e-01, 3.1744e-01, ..., 9.8407e-02,
1.0396e-01, 1.1010e-01],
[ 2.2787e-01, 2.8049e-01, 2.5370e-01, ..., 2.9538e-01,
3.1826e-01, 3.1299e-01],
[ 6.8315e-02, 7.2892e-02, 1.1020e-01, ..., -7.1713e-02,
-8.1271e-02, -4.5279e-02],
...,
[ 2.6714e-02, 6.7872e-02, 1.5585e-01, ..., 8.8021e-02,
8.6730e-03, 7.6960e-02],
[-2.5178e-02, 6.1933e-02, 9.9606e-02, ..., -1.2581e-02,
-3.6286e-03, 6.7224e-03],
[ 6.7677e-02, 8.8382e-02, 6.3443e-02, ..., 1.5692e-02,
-4.9952e-03, 3.0603e-02]],
[[ 2.4910e-01, 3.2694e-01, 2.9670e-01, ..., 1.1882e-01,
1.1018e-01, 1.1878e-01],
[ 2.1325e-01, 2.7530e-01, 2.4434e-01, ..., 2.9738e-01,
3.5416e-01, 3.1198e-01],
[ 6.0548e-02, 7.1420e-02, 1.0630e-01, ..., -7.2580e-02,
-7.3519e-02, -3.7565e-02],
...,
[ 2.1059e-02, 7.0689e-02, 1.6398e-01, ..., 7.0748e-02,
1.8903e-02, 9.4413e-02],
[-1.9248e-02, 8.2502e-02, 1.4179e-01, ..., 2.1170e-02,
5.5312e-02, 1.8289e-02],
[ 6.7050e-02, 7.8700e-02, 5.8725e-02, ..., 3.4673e-02,
2.3773e-02, 3.2414e-02]],
Can you give any advice on this error?
The parameter --checkpoint in the code run command that seems to be required to specify the path to the pre-trained model?