Closed jialuwang123321 closed 3 years ago
I am using torch 0.4.1, cuda9.2, python3.7.3, ubuntu 18.4
My command for
training:
python train.py --dataset RobotCar --scene loop --config_file configs/mapnet.ini --model mapnet --device 0 --learn_beta --learn_gamma
testing:
python eval.py --dataset RobotCar --scene loop --model mapnet --weights /project/scripts/logs/RobotCar_loop_mapnet_mapnet_learn_beta_learn_gamma/base/epoch_100.pth.tar --config_file configs/mapnet.ini --val
config file is:
[training]
n_epochs = 100
batch_size = 20
do_val = no
seed = 7
shuffle = yes
num_workers = 5
snapshot = 5
val_freq = 50
max_grad_norm = 0
[optimization]
opt = adam
lr = 1e-4
weight_decay = 0.0005
;momentum = 0.9
;lr_decay = 0.1
;lr_stepvalues = [60, 80]
[logging]
visdom = no
print_freq = 20
[hyperparameters]
beta = -3.0
gamma = -3.0
dropout = 0.5
skip = 10
variable_skip = no
real = no
steps = 3
color_jitter = 0.7
Hi @jialuwang123321 , thanks for using the code!
It is a strange issue. Some suggestions:
pose_stats.txt
somehow has significantly different values every training? If that is true, and then you use an outdated pose_stats.txt
for evaluation, that might produce a high evaluation error. Every training run overwrites pose_stats.txt
(see this line), but the process is not random, so theoretically the values should remain unchanged. But worth checking.Huge thanks for your answer! I checked the pose_stats.txt and it changed. (see below). I am modifying and retrain again.
5736290.7242747 620253.5877489 109.5124567 110.8290592 99.4226409 0.8314870
BTW, I am confused about the reason why it changed. I used 2014-06-26-09-24-58 for both training and evaluation. But I changed two things: 1) I set color_jitter = 0 in config file to stop using colorjitter for training 2) I used https://github.com/ori-mrg/robotcar-dataset-sdk to changed the robotcar dataset's images into colorful ones in advance Do you think it would be the reason why the pose_stats.txt changed? Thank you again for your patient help!
@jialuwang123321 you provided one value of pose_stats.txt
. But you should look at whether it changes significantly for every training. Also, the values you posted are not significantly different from the included pose_stats.txt
.
According to your suggestion, I checked it for every training. I found it remains unchanged for my training and testing dataset.
@jialuwang123321 OK then we can rule out stats.txt
as the cause of this issue. I don't have other suggestions, unfortunately. How large is the monotonic increase in error?
For example, when evaluating the 100_epoch.pth.tar, which was obtained by independently training mapnet from 0 to 100 epochs for three times, I got obviously different results:
First time Error in translation: median 4.37 m, mean 6.13 m Error in rotation: median 2.65 degrees, mean 3.51 degree
Second time Error in translation: median 255.56 m, mean 256.39 m Error in rotation: median 118.63 degrees, mean 114.35 degree
Third time Error in translation: median 155.40 m, mean 159.66 m Error in rotation: median 135.14 degrees, mean 129.25 degree
Oh, so it is not increasing monotonically. Is the training process is somehow changing your images, pose labels, or config files on the disk? Is backpropagation somehow disabled after the first run?
The ideas you offer are really helpful! I think it's very likely that I accidentally modified the code, causing the back propagation to be blocked But because I was not sure where it was accidentally changed, I downloaded the code again. Fortunately, the problem was solved. In the future, I will continue to explore what went wrong. If I get the answer, I will share it with you
glad you solved the issue!
Dear Mr. Samarth Brahmbhatt Thank you for your code! It is fantastic. However, I repeated three times, starting all over again (without resume checkpoint), training 0 to 100 epochs with mapnet, and using eval.py to test the trained model.
In theory, I train in strict accordance with the provided experimental parameters and compiling environment. The three pieces of training are completely independent and do not affect each other.
So the test results (for example, using eval.py to test the 100_epoch.pth.tar of the first, second, and third training respectively), the results should be almost the same. Unfortunately, the trained model (e.g. epoch_100.pth.tar) obtained in the second time's training gave an obviously larger error than that from the first-time's training. So do the third-time's results comparing with the second and the first time's ones.
I feel very confused. I'd like to ask for your opinion. Your suggestion will be very helpful to me. Thank you in advance!
Best Jialu