Training loss goes into NAN

Nestarneal commented 6 years ago

Hi, I tried to re-produce the result, but I saw the training loss go into NAN after iteration 15.

Following steps is what I've done:

Use training/get_lmdb.sh to get the LMDB data (due to lack of MATLAB on my PC)
Download your customized caffe and build it
In training/example_proto/train_pose.sh, I modified
- the path to caffe executable
- the path to the VGG model (this model is download from the link in README)
- gpu to all
In training/example_proto/pose_train_test.prototxt, I modified
- source path in data layer to the place that storing the LMDB data downloaded by using training/get_lmdb.sh
In training/example_proto/pose_solver.prototxt, I modified
- snapshot_prefix path to the place I prefered, and the training parameters are not modified

Is there any steps going wrong or I should adjust the parameters to avoid going into NAN? Many thanks.

anatolix commented 6 years ago

Hi, I am not sure, but have a guess about your problem, it could be a bug in PAF calculation.

I've made my own python augmentation code(btw it 5 times faster than C++ version and has significantly less code), and stuck into the bug like yours, loss became NaN.

I've found and fixed this bug by following code: https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/commit/9e5adc4d4af64b642562882cedf9e30cbf00ed05 The cause of the bug was what sometimes limb vector has zero length(body parts for start PAF and end PAF in same place, i.e. hand is perpendicular to image plane, for example directed to the camera) it produces NaN which kill whole neural network instantly.

After it I noticed original code probably has same bug too: https://github.com/CMU-Perceptual-Computing-Lab/caffe_train/blob/master/src/caffe/cpm_data_transformer.cpp

  float norm_bc = sqrt(bc.x*bc.x + bc.y*bc.y);
  bc.x = bc.x /norm_bc;
  bc.y = bc.y /norm_bc;

The solution for you problem - just check network input doesn't contain NaNs, if it does remove this picture from training.

coldgemini commented 6 years ago

I got the same issue, the loss suddenlly goes to NAN at some iter. Did you guys solved this problem by fixing this PAF bug?

ZheC / Realtime_Multi-Person_Pose_Estimation

Training loss goes into NAN #102