can't get the desired training result

TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository

https://tri-ml.github.io/packnet-sfm/

MIT License

1.24k stars 243 forks source link

can't get the desired training result #16

Closed yuqi1991 closed 4 years ago

yuqi1991 commented 4 years ago

Hi there, Thanks so much for sharing the training code, I tried to run with self-sup model over KITTI dataset, but it seems to have a weird result after running for a few epochs with pretrained model.

Screenshot from 2020-05-19 13-46-18

The loss get smaller during training, but the evaluation metrics of error get higher every epoch. Did you have ever encounter this situation ?

@AdrienGaidon-TRI @VitorGuizilini-TRI @spillai

VitorGuizilini-TRI commented 4 years ago

Which pretrained model are you using, and which config file?

yuqi1991 commented 4 years ago

@VitorGuizilini-TRI Pretrained model: PackNet, Self-Supervised, 192x640, KITTI (K) Config file: train_kitti.yaml

VitorGuizilini-TRI commented 4 years ago

Are you doing single GPU training, and with what batch size? The learning rates we specify are for 8 GPUs (as mentioned in the paper) and a batch size of 4, you might need to decrease them if you are doing single GPU training.

yuqi1991 commented 4 years ago

@VitorGuizilini-TRI I use 8 GPU as well and have been running for more than 10 hours, the avg loss is stagnating at around 0.09. I did also try with the tiny KITTI dataset at the first place, the loss decreases significantly when the lr is 5e-5 and I eventually get a decent result, but it doesn't happen on the raw KITTI dataset.

VitorGuizilini-TRI commented 4 years ago

We regularly use that model for fine-tuning on KITTI and other datasets, and never observed that behavior. I will look deeper, but one last question: are you loading the checkpoint in the config file (checkpoint_path) or are you resuming from the checkpoint itself?

yuqi1991 commented 4 years ago

I tried both training from scratch and loading the model by manually extracting the state dicts from the pretrained file.

yuqi1991 commented 4 years ago

Another question about the semi-sup model, why did you guys disable the photometric loss ? I think it would be better to joint the photometric loss and the pointcloud loss together, since the point cloud is sparse on the re-projection plane.

VitorGuizilini-TRI commented 4 years ago

About fine-tuning on pretrained models, I don't know what to say, I just ran an experiment here with the same code on this repository and it looks alright. The loss starts already stable at ~0.073, and the metrics are also stable at the numbers we report. Try decreasing the learning rate to 0.00005, that's the value at the end of the training session (0.0002/4). Are you running inside our docker?

media_images_val-KITTI_raw-eigen_test_files-velodyne-544-inv_depth_27_d6fa87c8

About the semi-sup model, we are using the self-supervised photometric loss in addition to the supervised loss. You can set the ratio in the config file, if you set to 1.0 then it's purely supervised.

yuqi1991 commented 4 years ago

yes, I was running inside the docker, and have just replaced the wandb logger with tensorboard for visualization. I will continue try look into it. So 0.073 is the best loss you get on your side right ? Thanks for your help.

MingYang-buaa commented 4 years ago

@yuqi1991 Hi, i'm newer to pytorch, can you shared the code for visualization with tensorboard?

yuqi1991 commented 4 years ago

@MingYang-buaa hi, since I already left the project, I can not provide you the source code, but maybe you can refer to the usage document here of tensorboard X, it's quite easy to use.