chensong1995 / HybridPose

HybridPose: 6D Object Pose Estimation under Hybrid Representation (CVPR 2020)
MIT License
418 stars 64 forks source link

Training from scratch #38

Closed neixlo closed 4 years ago

neixlo commented 4 years ago

Hi @chensong1995 , I'd like to train from scratch and for testing purposes do something like this. $ python src/train_core.py --batch_size 1 --n_epochs 2 --object_name cat --load_dir None

But this outputs: -> print('Could not restore session properly, check the load_dir') (Pdb)

If I add the parameter --load_dir None it sets load_dir to the string 'None' and is not None in terms of this line: if args.load_dir is not None:

However, if I modify it to something like: if args.load_dir != 'None': it seems to work. At least its training the ResNet networks.

After the nets are trained there is another error in thetrainer.generate_data(). I think it happens in this line, where the pr_para, pi_para = self.search_para(...) function gets called.

It outputs following in the end:

Epoch: [1][946/949] Time: 0.063 (0.066) Sym: 1.4552 (4.4397)    Mask: 0.0115 (0.0254)   Pts: 0.0234 (0.0563)    Graph: 4.4105 (7.8646)  Total: 0.8322 (1.8185)
Epoch: [1][947/949] Time: 0.062 (0.066) Sym: 3.8132 (4.4390)    Mask: 0.0290 (0.0254)   Pts: 0.0578 (0.0563)    Graph: 3.8165 (7.8603)  Total: 1.3695 (1.8181)
Epoch: [1][948/949] Time: 0.063 (0.066) Sym: 1.5503 (4.4360)    Mask: 0.0092 (0.0253)   Pts: 0.0182 (0.0562)    Graph: 3.4466 (7.8556)  Total: 0.6913 (1.8169)
python: eigen/Eigen/src/Core/DenseCoeffsBase.h:410: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Aborted (core dumped)

If I run it with the --load_dir set to a saved model it runs through the training and the trainer.generate_data().

$ python src/train_core.py --batch_size 1 --n_epochs 501 --object_name cat --load_dir /home/nixi/Projects/HybridPose_custom/data/saved_weights/occlusion_linemod/cat/checkpoints/0.02/499

That will output: saved So that means for me that the regressor can access the eigen library and the $LD_LIBRARY path is setup correctly.

Do I miss something? Any idea whats going on?

Thanks and keep up the good work!

neixlo commented 4 years ago

I think I found out why this happens.

When I train for more epochs it's fine to train without a preloaded model. For me it seems that the model, if not trained well (in this case long) enough, the further processing in the refinement module fails. This could happen because the model at this early training stage just outputs nothing or random stuff which seems to result in an empty matrix, which then produces the error in the eigen lib.

Solution which worked for me: Train longer then 1 epoch.

50 epochs worked fine for me, I didn't tried less.