ResNet-50 trained model questions

bobetocalo commented 5 years ago

Dear Zhenhua Feng,

First of all I would like to congratulate you for your excellent work. I'm a PhD student at Spain. My research is focused on face alignment. I have read your https://arxiv.org/pdf/1711.06753.pdf paper and I would like to ask some questions.

Can you explain me more in detail how do you achieve these results using ResNet-50? The only thing I know is that you have fine-tuned the ImageNet weights but ... do you freeze any layers of the ResNet-50 architecture? Are you changing only the last fully-connected layer of 1000 classes for another of num_landmarks*2 classes?

According to Section 2 (Related work), there are some approaches that learn a heat map for each landmark. However, your approach regress directly the landmark positions (x,y). Is that right? Is it possible to obtain your ResNet-50 trained model on 300W in a programming language such as Tensorflow or Pytorch to be easily readable from a Python script?

I look forward to your response.

Best regards, Roberto Valle

FengZhenhua commented 5 years ago

@bobetocalo

Dear Roberto,

Thank you very much for your message. For your questions:

We have used multiple settings of the ResNet50 model. But the model we used in the paper just simply adds another fully connected layer to the last 1000-class output layer, which outputs num_landmarks*2 vectors directly.

Yes, you are right, the model directly outputs landmark positions (x1, y1, ..., xL, yL) in the form of a vector.

We have a single CNN7 Caffe model with 68 landmarks for internal use only now. But we do not have the ResNet50 model that is readable from Python.

Best regards,

Zhenhua

FengZhenhua commented 5 years ago

@bobetocalo I forgot to mention that it will be better if you freeze the first few layers by setting their learning rate to 0.

xjcvip007 commented 5 years ago

Thank for job, from above, you mean that you use num_landmarks*2 vectors connect the last 1000-class output layer directly instead of replacing the last 1000-class output layer, right? Another question, for your two-stage landmark localisation method, if one image contains two people, how does it works? Thank you~

FengZhenhua commented 5 years ago

@xjcvip007 Hi, in the paper we did use the 1000-class layer directly. However, we also tried to replace it by a 1024 or 2048 FC layer. The results are quite similar. For your second question, if one cropped image has two people, usually the correct one is at the centre, so the algorithm will automatically be learnt to locate the correct one. We did not meet such an issue in practice.

FengZhenhua / Wing-Loss

ResNet-50 trained model questions #13