ByungKeon-Ko / mlstudy_week7

Paper Study : A Convolutional Neural Network Cascade for Face Detection
3 stars 1 forks source link

Hi,a few questions #1

Open tangyudi opened 8 years ago

tangyudi commented 8 years ago

Are you Chinese?I also study this paper now,Do you have QQ or wechat, i want to ask some questions about the code and the paper.

ByungKeon-Ko commented 8 years ago

Hi, TangYudi. I'm Korean, and I don't have those things. But, if you want, you may ask me something by e-mail, then I will answer you as far as I can.

tangyudi commented 8 years ago

Hi,thank you and I have a few questions 1.In the paper it says On a single CPU core, it takes 12-net less than 36 ms to densely scan an image of size 800 × 600 for 40 × 40 faces with 4-pixel spacing, which generates 2, 494 detection windows. but i do not know how the 2, 494 is been calculated. 2.The image pyramid is resized by 12/F as the input image for the 12-net.What is the meaning of 12/F and the how many scales do the different scales usually have? 3.I put a 466 x 699 image to the network after resize_image it is 139 x209 and the (out = net_12c_full_conv.blobs['prob'].data[0][1, :, :]) out.shape is 64*99, is each confidence point in 64 x99 means the possibility of a face and if so why a point can represent a rectangle? 4.I use 1W face image and 1W background image without face to train the 12net, is it enough? I am new in face detect. It is really nice of you to help me and can you tell me your email address?

ByungKeon-Ko commented 8 years ago
  1. Simply, ( 800 * 12/F - 12) * ( 600 * 12/F -12) / 4 / 4, where F is minimum face size : 40 and 12 means 12net input data size. Real number might be little different, because I skip quantization calculation.
  2. As I described (1), before slide windows, you need to decimate input image by 12 / F. For example, Unless you decimate, there is no detected window at 1st pyramid, so it become meaningless calculations. For # of pyramids, the writer described it on his QnA paper.

    http://personal.stevens.edu/~hli18/papers/faq_CVPR2015_CasCNN.html

  3. The output matrix of the network means the face positions of face size 12. If you revert the scale, you can get the original face size and location.
  4. What is 1 W? I cannot understand. The writer said he used 200K for negative samples.