Everytime I restart training from the begining, the obtained model gaves a larger error.

NVlabs / geomapnet

Geometry-Aware Learning of Maps for Camera Localization (CVPR2018)

https://goo.gl/mRB3Au.

Other

347 stars 79 forks source link

Everytime I restart training from the begining, the obtained model gaves a larger error. #38

Closed jialuwang123321 closed 3 years ago

jialuwang123321 commented 3 years ago

Dear Mr. Samarth Brahmbhatt Thank you for your code! It is fantastic. However, I repeated three times, starting all over again (without resume checkpoint), training 0 to 100 epochs with mapnet, and using eval.py to test the trained model.

In theory, I train in strict accordance with the provided experimental parameters and compiling environment. The three pieces of training are completely independent and do not affect each other.
So the test results (for example, using eval.py to test the 100_epoch.pth.tar of the first, second, and third training respectively), the results should be almost the same. Unfortunately, the trained model (e.g. epoch_100.pth.tar) obtained in the second time's training gave an obviously larger error than that from the first-time's training. So do the third-time's results comparing with the second and the first time's ones.

I feel very confused. I'd like to ask for your opinion. Your suggestion will be very helpful to me. Thank you in advance!

Best Jialu

jialuwang123321 commented 3 years ago

I am using torch 0.4.1, cuda9.2, python3.7.3, ubuntu 18.4

My command for training: python train.py --dataset RobotCar --scene loop --config_file configs/mapnet.ini --model mapnet --device 0 --learn_beta --learn_gamma

testing: python eval.py --dataset RobotCar --scene loop --model mapnet --weights /project/scripts/logs/RobotCar_loop_mapnet_mapnet_learn_beta_learn_gamma/base/epoch_100.pth.tar --config_file configs/mapnet.ini --val

config file is:

[training]
n_epochs = 100
batch_size = 20
do_val = no
seed = 7
shuffle = yes
num_workers = 5
snapshot = 5
val_freq = 50
max_grad_norm = 0

[optimization]
opt = adam
lr = 1e-4
weight_decay = 0.0005
;momentum = 0.9
;lr_decay = 0.1
;lr_stepvalues = [60, 80]

[logging]
visdom = no
print_freq = 20

[hyperparameters]
beta = -3.0
gamma = -3.0
dropout = 0.5
skip = 10
variable_skip = no
real = no
steps = 3
color_jitter = 0.7

samarth-robo commented 3 years ago

Hi @jialuwang123321 , thanks for using the code!

It is a strange issue. Some suggestions:

Make absolutely sure that you are not resuming from the previous training's checkpoint (looks like you are not resuming, based on the training command you mentioned)
See if pose_stats.txt somehow has significantly different values every training? If that is true, and then you use an outdated pose_stats.txt for evaluation, that might produce a high evaluation error. Every training run overwrites pose_stats.txt (see this line), but the process is not random, so theoretically the values should remain unchanged. But worth checking.

jialuwang123321 commented 3 years ago

Huge thanks for your answer! I checked the pose_stats.txt and it changed. (see below). I am modifying and retrain again.

5736290.7242747 620253.5877489 109.5124567 110.8290592 99.4226409 0.8314870

BTW, I am confused about the reason why it changed. I used 2014-06-26-09-24-58 for both training and evaluation. But I changed two things: 1) I set color_jitter = 0 in config file to stop using colorjitter for training 2) I used https://github.com/ori-mrg/robotcar-dataset-sdk to changed the robotcar dataset's images into colorful ones in advance Do you think it would be the reason why the pose_stats.txt changed? Thank you again for your patient help!

samarth-robo commented 3 years ago

@jialuwang123321 you provided one value of pose_stats.txt. But you should look at whether it changes significantly for every training. Also, the values you posted are not significantly different from the included pose_stats.txt.

jialuwang123321 commented 3 years ago

According to your suggestion, I checked it for every training. I found it remains unchanged for my training and testing dataset.

samarth-robo commented 3 years ago

@jialuwang123321 OK then we can rule out stats.txt as the cause of this issue. I don't have other suggestions, unfortunately. How large is the monotonic increase in error?

jialuwang123321 commented 3 years ago

For example, when evaluating the 100_epoch.pth.tar, which was obtained by independently training mapnet from 0 to 100 epochs for three times, I got obviously different results:

First time Error in translation: median 4.37 m, mean 6.13 m Error in rotation: median 2.65 degrees, mean 3.51 degree

Second time Error in translation: median 255.56 m, mean 256.39 m Error in rotation: median 118.63 degrees, mean 114.35 degree

Third time Error in translation: median 155.40 m, mean 159.66 m Error in rotation: median 135.14 degrees, mean 129.25 degree

samarth-robo commented 3 years ago

Oh, so it is not increasing monotonically. Is the training process is somehow changing your images, pose labels, or config files on the disk? Is backpropagation somehow disabled after the first run?

jialuwang123321 commented 3 years ago

The ideas you offer are really helpful！ I think it's very likely that I accidentally modified the code, causing the back propagation to be blocked But because I was not sure where it was accidentally changed, I downloaded the code again. Fortunately, the problem was solved. In the future, I will continue to explore what went wrong. If I get the answer, I will share it with you

samarth-robo commented 3 years ago

glad you solved the issue!