WangYueFt / dcp

346 stars 90 forks source link

Can I train a model as good as the pre-trained model? #17

Closed qiaozhijian closed 4 years ago

qiaozhijian commented 4 years ago

I loaded the pre-trained model "dcp_v2.t7" first, and test it with loss 0.000212. Then I continued training on this basis. But I get a bigger loss about 0.005. Later I discovered that this was caused by my learning rate setting being too large(0.001). So I set it as 0.00001 which is the value you suggest after 200 epochs in your code. Strangely, the loss has not continued to decrease. So i want to ask, Can I train a model as good as the pre-trained model after 250 epoch as your paper says?

qiaozhijian commented 4 years ago

Another case is that I found that the loss drops very slowly without loading the pre-trained model. So much so that I feel that the effect of pre-training the model is far in the future.

qiaozhijian commented 4 years ago

By the way, the batch_size I set is 8. Looking forward to your reply, thank you!

WangYueFt commented 4 years ago

By the way, the batch_size I set is 8. Looking forward to your reply, thank you!

Hi,

Can you use large batch size, say 32? I suspect the batch norm/gradients are not stable if using small batch size.

qiaozhijian commented 4 years ago

By the way, the batch_size I set is 8. Looking forward to your reply, thank you!

Hi,

Can you use large batch size, say 32? I suspect the batch norm/gradients are not stable if using small batch size.

Ok,I try it out. Thank you

MaxChanger commented 3 years ago

Hello @qiaozhijian I'm stuck in a strange problem, hope you can provide some more detailed information about your test (if you remember).

  1. What is the pytorch version and graphics card type with loss=0.000212 (I guess it is 1080Ti)
  2. Have you run it on other hardware devices (other graphics card devices) can you get the same loss

Because I found that in the same conda environment with different hardware platforms (1080Ti and 2080Ti), I will get different test results. and part of the running results are as follows.

### in 1080Ti  torch1.0.0
python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=pretrained/dcp_v2.t7
Namespace(batch_size=32, cycle=False, dataset='modelnet40', dropout=0.0, emb_dims=512, emb_nn='dgcnn', epochs=250, eval=True, exp_name='dcp_v2', factor=4, ff_dims=1024, gaussian_noise=False, head='svd', lr=0.001, model='dcp', model_path='pretrained/dcp_v2.t7', momentum=0.9, n_blocks=1, n_heads=4, no_cuda=False, num_points=1024, pointer='transformer', seed=1234, test_batch_size=10, unseen=False, use_sgd=False)
/home/****/Repo/dcp/data.py:36: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  f = h5py.File(h5_name)
pretrained/dcp_v2.t7
Let's use 2 GPUs!
100%|█████████████████████████████████████████████████████████| 247/247 [00:42<00:00,  5.79it/s]
==FINAL TEST==
A--------->B
EPOCH:: -1, Loss: 0.000212, Cycle Loss: 0.000000, MSE: 0.236371, RMSE: 0.486180, MAE: 0.378167, rot_MSE: 1.196769, rot_RMSE: 1.093970, 
                                    rot_MAE: 0.751517, trans_MSE: 0.000003, trans_RMSE: 0.001717, trans_MAE: 0.001173
B--------->A
EPOCH:: -1, Loss: 0.000212, MSE: 0.236371, RMSE: 0.486180, MAE: 0.355744, rot_MSE: 1.196769, rot_RMSE: 1.093970, rot_MAE: 0.751517, 
                                    trans_MSE: 0.000054, trans_RMSE: 0.007321, trans_MAE: 0.004827
### in 2080Ti  torch1.0.0
python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=pretrained/dcp_v2.t7
Namespace(batch_size=32, cycle=False, dataset='modelnet40', dropout=0.0, emb_dims=512, emb_nn='dgcnn', epochs=250, eval=True, exp_name='dcp_v2', factor=4, ff_dims=1024, gaussian_noise=False, head='svd', lr=0.001, model='dcp', model_path='pretrained/dcp_v2.t7', momentum=0.9, n_blocks=1, n_heads=4, no_cuda=False, num_points=1024, pointer='transformer', seed=1234, test_batch_size=10, unseen=False, use_sgd=False)
/home/********/Repo/tmp/dcp/data.py:36: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  f = h5py.File(h5_name)
pretrained/dcp_v2.t7
Let's use 2 GPUs!
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 247/247 [00:30<00:00,  8.21it/s]
==FINAL TEST==
A--------->B
EPOCH:: -1, Loss: 0.000218, Cycle Loss: 0.000000, MSE: 0.236347, RMSE: 0.486156, MAE: 0.378243, rot_MSE: 1.217545, rot_RMSE: 1.103424, 
                                    rot_MAE: 0.750243, trans_MSE: 0.000003, trans_RMSE: 0.001696, trans_MAE: 0.001170
B--------->A
EPOCH:: -1, Loss: 0.000218, MSE: 0.236347, RMSE: 0.486156, MAE: 0.355723, rot_MSE: 1.217545, rot_RMSE: 1.103424, rot_MAE: 0.750243,
                                    trans_MSE: 0.000056, trans_RMSE: 0.007500, trans_MAE: 0.004823
qiaozhijian commented 3 years ago

Hello @qiaozhijian I'm stuck in a strange problem, hope you can provide some more detailed information about your test (if you remember).

1. What is the pytorch version and graphics card type with loss=0.000212 (I guess it is 1080Ti)

2. Have you run it on other hardware devices (other graphics card devices) can you get the same loss

Because I found that in the same conda environment with different hardware platforms (1080Ti and 2080Ti), I will get different test results. and part of the running results are as follows.

### in 1080Ti  torch1.0.0
python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=pretrained/dcp_v2.t7
Namespace(batch_size=32, cycle=False, dataset='modelnet40', dropout=0.0, emb_dims=512, emb_nn='dgcnn', epochs=250, eval=True, exp_name='dcp_v2', factor=4, ff_dims=1024, gaussian_noise=False, head='svd', lr=0.001, model='dcp', model_path='pretrained/dcp_v2.t7', momentum=0.9, n_blocks=1, n_heads=4, no_cuda=False, num_points=1024, pointer='transformer', seed=1234, test_batch_size=10, unseen=False, use_sgd=False)
/home/****/Repo/dcp/data.py:36: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  f = h5py.File(h5_name)
pretrained/dcp_v2.t7
Let's use 2 GPUs!
100%|█████████████████████████████████████████████████████████| 247/247 [00:42<00:00,  5.79it/s]
==FINAL TEST==
A--------->B
EPOCH:: -1, Loss: 0.000212, Cycle Loss: 0.000000, MSE: 0.236371, RMSE: 0.486180, MAE: 0.378167, rot_MSE: 1.196769, rot_RMSE: 1.093970, 
                                    rot_MAE: 0.751517, trans_MSE: 0.000003, trans_RMSE: 0.001717, trans_MAE: 0.001173
B--------->A
EPOCH:: -1, Loss: 0.000212, MSE: 0.236371, RMSE: 0.486180, MAE: 0.355744, rot_MSE: 1.196769, rot_RMSE: 1.093970, rot_MAE: 0.751517, 
                                    trans_MSE: 0.000054, trans_RMSE: 0.007321, trans_MAE: 0.004827
### in 2080Ti  torch1.0.0
python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=pretrained/dcp_v2.t7
Namespace(batch_size=32, cycle=False, dataset='modelnet40', dropout=0.0, emb_dims=512, emb_nn='dgcnn', epochs=250, eval=True, exp_name='dcp_v2', factor=4, ff_dims=1024, gaussian_noise=False, head='svd', lr=0.001, model='dcp', model_path='pretrained/dcp_v2.t7', momentum=0.9, n_blocks=1, n_heads=4, no_cuda=False, num_points=1024, pointer='transformer', seed=1234, test_batch_size=10, unseen=False, use_sgd=False)
/home/********/Repo/tmp/dcp/data.py:36: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  f = h5py.File(h5_name)
pretrained/dcp_v2.t7
Let's use 2 GPUs!
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 247/247 [00:30<00:00,  8.21it/s]
==FINAL TEST==
A--------->B
EPOCH:: -1, Loss: 0.000218, Cycle Loss: 0.000000, MSE: 0.236347, RMSE: 0.486156, MAE: 0.378243, rot_MSE: 1.217545, rot_RMSE: 1.103424, 
                                    rot_MAE: 0.750243, trans_MSE: 0.000003, trans_RMSE: 0.001696, trans_MAE: 0.001170
B--------->A
EPOCH:: -1, Loss: 0.000218, MSE: 0.236347, RMSE: 0.486156, MAE: 0.355723, rot_MSE: 1.217545, rot_RMSE: 1.103424, rot_MAE: 0.750243,
                                    trans_MSE: 0.000056, trans_RMSE: 0.007500, trans_MAE: 0.004823

Torch version is 1.4.0. GPU:1080Ti in the same conda environment with different hardware platforms (1080Ti and 2080Ti), I will get the same train results. But needed time is different.

MaxChanger commented 3 years ago

Thanks for your quick reply @qiaozhijian Yes, I also think that 250 epochs is difficult to get the same results as in the paper, and it may take more time.

But the result I gave above is not the result of the training stage, but the evaluation based on the pretrained model provided by the author and command in README, e.g.

python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=xx/yy
  1. I am increasing the number of epochs to verify my guess. By the way, how many epochs did you run to achieve the approximate results in the paper?
  2. You said that you could get the same results under different running times (I think it refers to the training phase). Have you tried running the evaluation code and pretrained model provided by the author on different devices. Unfortunately, I got different results, as shown above.
qiaozhijian commented 3 years ago

Thanks for your quick reply @qiaozhijian Yes, I also think that 250 epochs is difficult to get the same results as in the paper, and it may take more time.

But the result I gave above is not the result of the training stage, but the evaluation based on the pretrained model provided by the author and command in README, e.g.

python main.py --exp_name=dcp_v2 --model=dcp --emb_nn=dgcnn --pointer=transformer --head=svd --eval --model_path=xx/yy
1. I am increasing the number of epochs to verify my guess. By the way, how many epochs did you run to achieve the approximate results in the paper?

2. You said that you could get the same results under different running times (I think it refers to the training phase).  Have you tried running the evaluation code and pretrained model provided by the author on different devices. Unfortunately, I got different results, as shown above.
  1. I forget. But 250 is enough and even unnecessarym. You can watch the tensorboard.
  2. No. Your result is possible. But I don't think it's important.
MaxChanger commented 3 years ago

Ok, thanks again for your help 👍 @qiaozhijian

MaxChanger commented 3 years ago

Hello @qiaozhijian, excuse me again, with a new progress.

All my previous test results were tested on 4×2080Ti (~7000MB each gpu). The results of my own training were worse than those in the paper. Occasionally I used 3×2080Ti (Two ~9600MB, One ~8800MB) for training, I was very surprised to find that about 160 epochs can get the same or better result as in the paper.

train_batch_size=32, test_batch_size=10, Consistent with the default settings in the README

What do you think of this problem, by the way, how many graphics cards are you training on?

Thanks.