ZheC / Realtime_Multi-Person_Pose_Estimation

Code repo for realtime multi-person pose estimation in CVPR'17 (Oral)
Other
5.08k stars 1.37k forks source link

About generating vector field in CPM_data_transformer.cpp #124

Open RuWang15 opened 6 years ago

RuWang15 commented 6 years ago

I read the code and I am confused about whether the vector field is generated merely by the position of joints, because I didn't find anything about putting vector in a region? https://github.com/CMU-Perceptual-Computing-Lab/caffe_train/blob/76dd9563fb24cb1702d0245cda7cc36ec2aed43b/src/caffe/cpm_data_transformer.cpp#L1137

anatolix commented 6 years ago

For each layer of PAF there is 2 layers with X coordinate of vector and Y coordinate of vector. Note difference in (np+ 1+ 2*i) and (np+ 2+ 2*i)

RuWang15 commented 6 years ago

@anatolix Thanks, but I'm still confused about whether the vector field has something to do with the area of limbs?And where can I find in the code?

anatolix commented 6 years ago

I haven't fully understand the question. Code you've liked do whole path generation.

PAF looks aproximately like this: 0000003paf (1 vector for each 8x8 image patch) picture is generated with https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/blob/master/py_rmpe_server/rmpe_server_tester.py

RuWang15 commented 6 years ago

Thank you very much for the excellent example! I will explan my question with your picture. paf Take the vector field on the arm for example. My question is that "how do you know the area of the forearm so that you can put vector on it other than the region besides the arm?"

anatolix commented 6 years ago

It doesn't know the area. (It could have used segmentation for that, but it's not used actually). PAF match to hand size is lucky accident in this picture, there is no exact match for other PAFs.

Currently It just calculates segment A->B and draws PAF for every 8x8 square for all squares which center is close than 8 pixels to A->B. Look for putVecMaps fucntion for details. Line if(dist <= thre){ controls the PAF placement. Code above calculates distance to segment.

RuWang15 commented 6 years ago

Thank you very much! One last question (actually more than one), are all the training data masked by mask_all or mask_miss before going through the network and the masked part is black? And if a picture contains 5 annotated people, the picture appears 5 times in the data, right?

anatolix commented 6 years ago

One last question (actually more than one), are all the training data masked by mask_all or mask_miss before going through the network and the masked part is black?

mask_all never used for anything except visualization. mask_miss is not actually a picture in os array of float from 0.0 ... 1.0 loss is just multiplied for this mask. If you multiply mask_miss on 255 and convert to integer you will get something looking like picture where masked parts are black and non masked are white. original picture is not modified except for vgg preprocessing.

And if a picture contains 5 annotated people, the picture appears 5 times in the data, right?

short answer is yes. long answer some of them filtered to not to feed pictures which are too close to each other

RuWang15 commented 6 years ago

You mean, for the vgg layers at the beginning of the network, they use the masked pictures, and after these layers they use original pictures? I'm really confused about the ‘mask’ part 😂

anatolix commented 6 years ago

Mask never touch pictures.

0) Mask has exactly same dimensions as ground truth and network output. ie 46 x 46 x num_layers.

Mask applied to: 1) ground truth heatmap and pafs (multiplied by mask) 2) network output (multiplied by mask)

If in same point of answer mask is zero this means "ignore answers in this point while training network" because loss will be zero in this point.

Some pictures about masks here https://github.com/michalfaber/keras_Realtime_Multi-Person_Pose_Estimation/issues/8#issuecomment-342977756

Ai-is-light commented 6 years ago

@ @anatolix Thanks for your answers. But , I would like to how the label's format(or the CPM and PAF branch's ground truth), I mean, did you show the label or ground truth of the CPM and PAF branch and the output of every stage. I'm confused about the labels' format and the output of the every stage. Thanks

Ai-is-light commented 6 years ago

@ @anatolix @ @anatolix Thanks for your answers. But , I would like to how the label's format(or the CPM and PAF branch's ground truth), I mean, did you show the label or ground truth of the CPM and PAF branch and the output of every stage. I'm confused about the labels' format and the output of the every stage. Thanks

anatolix commented 6 years ago

I am not sure I completely understand the question but about loss and stages:

Actually each stage has same output format, exactly same Ground Truth, and loss is calculated on each stage. In ideal world last stage will be enough, but in real world network is very deep and gradients of network will be completely lost to last layer. To push them thru network, we "tweak" them in middle layers to in right direction. This tweak is called 'intermediate layer supervision' and if you want know more about it you should read previous work "convolution pose machines" https://arxiv.org/pdf/1602.00134.pdf