davidsandberg / facenet

Face recognition using Tensorflow
MIT License
13.83k stars 4.81k forks source link

SphereFace & A-Softmax #385

Open sczhengyabin opened 7 years ago

sczhengyabin commented 7 years ago

Hi, @davidsandberg: Recent paper SphereFace: Deep Hypersphere Embedding for Face Recognition introduces a new novel LOSS definition to the face recognition training procedure, and it got a very decent result on MegaFace and LFW(99.42% according to author) by only training on CASIA-WebFace. I tried to add this loss to the facenet code, but I failed to translate it from paper into tensorflow code, cause it seems much complicated than the "Center Loss". Can you or anyone else try this? Exciting breakthrough maybe.

davidsandberg commented 7 years ago

Thanks! I will have a look at it. But what I could see the modification can be added before the tensorflow softmax, so hopefully it's not too cumbersome to implement.

bkj commented 7 years ago

FWIW -- it looks like they're about to release code for their models here

https://github.com/wy1iu/sphereface

I've tried to implement some of their ideas with some success, but not to the levels that they report in the paper yet. Excited to see their code to figure out where my implementation differs.

It's not particularly difficult to implement -- you basically just shrink the logit of the correct class label before applying the softmax. I'd guess it's easier than center loss.

sczhengyabin commented 7 years ago

@bkj Yeah, I noticed that.

However, kinda like a TF version rather than Caffe.

bkj commented 7 years ago

Yeah but it will provide a reference implementation that we can work from -- w/o code that produces the results reported in a paper, it's usually pretty difficult to get those numbers (eg because of hyperparameters and other things that aren't mentioned in the paper)

aleksandar commented 7 years ago

Hi guys, look at this paper: L2-constrained Softmax Loss for Discriminative Face Verification (https://arxiv.org/pdf/1703.09507.pdf). This is similar solution but easier to implement.

bkj commented 7 years ago

Yeah definitely easier to implement. It doesn't look like they do the same experiments as the Sphereface paper (train on CASIA, evaluate on LFW), so hard to compare head to head, but easy enough to implement and test.

AFAICT, there are various permutations of these losses that could be tried:

  yes/no normalized weights
  yes/no normalize features
  yes/no margin penalty (and if yes, in what form)
aleksandar commented 7 years ago

In this paper they mentioned SphereFace method and reported better results.

bkj commented 7 years ago

Better results using 7x more training images -- 3.7M training images vs. 0.5M. The particularly notable thing IMO about SphereFace is they report good results (on both LFW and Megaface) using a relatively small training dataset.

ugtony commented 7 years ago

L2-Softmax Loss was also trained on a 0.5M dataset(trained on MS-small instead of CASIA-Webface) and got 99.28% on LFW, which is lower than SphereFace's 99.42%.

In my opinion, L2-Softmax doesn't have much power to make intra-class variations smaller. That is why the performance can be further improved to 99.33% when used with center loss. I was curious whether the parameter m introduced in SphereFace can be added to L2-Softmax Loss, since it seems quite powerful in terms of intra-class variation reduction.

aleksandar commented 7 years ago

I trained this model with l2-softmax on casia set and got a lower results than standard David's loss function. I put normalization and scaling between two bottlenecks, and use suggested alfa param proposed by the paper for big training datasets. Maybe normalization after prelogits layer?

bkj commented 7 years ago

I've never done an implementation in Tensorflow.

apollo-time commented 7 years ago

How can I A-softmax with tensorflow?

ugtony commented 7 years ago

@aleksandar I put normalization and scaling with alpha=20 after bottleneck and got slightly worse results as well. (trained on casia)

JianbangZ commented 7 years ago

@ugtony me too here. I use alpha=24, do the normalization after prelogits just like embeddings, and use the normalized features before scale layer for testing. With softmax loss + center loss I can get around 98.5% on casia, but stucked at 98% when do l2 softmax + center loss. The author use batch size of 256 (2 gpus though) for training, so maybe a larger batch size would help?

bkj commented 7 years ago

Related question: has anyone ever been able to replicate the > 99% LFW accuracy from CASIA training reported in the center loss paper, and implemented in this repo:

https://github.com/ydwen/caffe-face

I tried training the models at got accuracies around 0.98, and even using their pretrained vectored only accuracies of 0.987.

aleksandar commented 7 years ago

@ugtony , @JianbangZ i used alpha 40 and have 99.6 LFW accuracy on random test-pairs. Do you maybe measure TPR@FAR 10e-3 or similar?

bkj commented 7 years ago

@aleksandar Could you explain more precisely what you did to get those results (architecture, training set, etc)? Could you post your code on a fork somewhere? And is that 99.6 on the standard LFW benchmark set (6000 pairs)?

aleksandar commented 7 years ago

training set is cleared casia without 3 subjects that overlaps with LFW set. architecture is inseption resnet v1 but i generated new random test pairs from lfw set. yes, its not comparable with other results :(

bkj commented 7 years ago

Could you run on the canonical set for comparison? Would be very interested.

aleksandar commented 7 years ago

validate_on_lfw results: Accuracy: 0.964+-0.005 Validation rate: 0.74667+-0.03141 @ FAR=0.00133 Area Under Curve (AUC): 0.994 Equal Error Rate (EER): 0.038

bkj commented 7 years ago

Ah interesting -- so still not getting close to the results reported in the paper. Thanks for running that benchmark.

On Wed, Jul 26, 2017 at 4:27 PM aleksandar notifications@github.com wrote:

validate_on_lfw results: Accuracy: 0.964+-0.005 Validation rate: 0.74667+-0.03141 @ FAR=0.00133 Area Under Curve (AUC): 0.994 Equal Error Rate (EER): 0.038

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidsandberg/facenet/issues/385#issuecomment-318172453, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzgfXhNKhPC6_Q077Kkr_4zxeyR5fgvks5sR6E6gaJpZM4Odooh .

apollo-time commented 7 years ago

Is it right that I implement L2-Softmax? I calculate the softmax on alpha scaled embeddings instead of prelogits.

prelogits, _ = network.inference(images, args.keep_probability, 
    phase_train=phase_train_placeholder, bottleneck_layer_size=args.embedding_size, 
    weight_decay=args.weight_decay)
embeddings = tf.nn.l2_normalize(prelogits, 1, 1e-10, name='embeddings')
if args.l2_softmax_alpha > 0:
    prelogits = embeddings * args.l2_softmax_alpha
logits = slim.fully_connected(prelogits, len(train_set), activation_fn=None, 
        weights_initializer=tf.truncated_normal_initializer(stddev=0.1), 
        weights_regularizer=slim.l2_regularizer(args.weight_decay),
        scope='Logits', reuse=False)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=labels, logits=logits, name='cross_entropy')
ugtony commented 7 years ago

@apollo-time Your version is quite similar to mine. I put l2_normalize after bottleneck layer instead. I think both versions are fine.

apollo-time commented 7 years ago

@ugtony Actually the network inference contains bottleneck layer, so that I think it is same with yours.

def inception_resnet_v2(inputs, is_training=True,
                        dropout_keep_prob=0.8,
                        bottleneck_layer_size=128,
                        reuse=None,
                        scope='InceptionResnetV2'):
####################
                with tf.variable_scope('Logits'):
                    end_points['PrePool'] = net
                    #pylint: disable=no-member
                    net = slim.avg_pool2d(net, net.get_shape()[1:3], padding='VALID',
                                          scope='AvgPool_1a_8x8')
                    net = slim.flatten(net)

                    net = slim.dropout(net, dropout_keep_prob, is_training=is_training,
                                       scope='Dropout')

                    end_points['PreLogitsFlatten'] = net

                net = slim.fully_connected(net, bottleneck_layer_size, activation_fn=None, 
                        scope='Bottleneck', reuse=False)

    return net, end_points

I'm training this on MS-Celeb and I'll share the result. And please teach me how to implement A-softmax with tensorflow @ugtony .

aleksandar commented 7 years ago

@apollo-time i did same

apollo-time commented 7 years ago

I share training result with l2 softmax on MS-Celeb datasets.

Runnning forward pass on LFW images Accuracy: 0.990+-0.005 Validation rate: 0.91100+-0.02621 @ FAR=0.00100

I think the result is bad than with softmax that accuarcy 0.992. But I see the regularization loss is smaller. Is it mean the l2-softmax is more generalization?

l2-softmax

zhouhui1992 commented 7 years ago

@apollo-time when you calculate the center losss ,you use the original prelogits or the prelogits after l2-softmax ?which network did you use?

apollo-time commented 7 years ago

@zhouhui1992 I calculate the center loss on l2 normalized embeddings, and I use inception_resnet_v1.

JianbangZ commented 7 years ago

@apollo-time Have any of you tried training L2-SOFTMAX without center loss? What we are supposed to see according to the paper is that L2-softmax + center > softmax + center > softmax

zhouhui1992 commented 7 years ago

@apollo-time Why did you use the l2 normalized embeddings to calculate the center loss rather than the original prelogits or scaled embeddings ? I use inception_resnet_v2, when i use L2-softmax to fine-tune my model trained on MS-cele-1M dataset ,the accuracy and val is always drop down. The best accuracy without l2-softmax is 0.994 and val_rate=0.9603@FAR=0.001, what is your best accuracy and val_rate on LFW?

JianbangZ commented 7 years ago

@apollo-time @ugtony I think L2-softmax was experimented after the first FC layer which is high dimension(1792 etc.), Not sure if the reason we are not seeing improvements is b/c we are actually trying to apply normalization on reduced dimension bottleneck layer.

apollo-time commented 7 years ago

@JianbangZ I'm trying to train L2-softmax without bottleneck layer.

JianbangZ commented 7 years ago

@apollo-time Please share the results when it's done. Thanks!

apollo-time commented 7 years ago

no_bottleneck I share the result to train L2-softmax without bottleneck layer. Database : MS-Celeb Model : inception_resnet_v1 Final LFW Accuracy : 99.25%, val-rate : 0.9781

I did not test with A-softmax yet because I don't know how to implement it with tensorflow. :-(

jack55436001 commented 7 years ago

hello everyone I trace the code on branch what's difference between angular_softmax_loss_decomp <- I have no idea code meaning angular_softmax_loss <- I think this is the sphere face implement

JianbangZ commented 7 years ago

@jack55436001 I think those are some experimenting code David wrote in a new branch. The master branch doesn't contain those code.

jack55436001 commented 7 years ago

@JianbangZ Thanks for your answer^^

zhengge commented 7 years ago

Hi everyone, I use the A softmax, but I can not reproduce the experimental results in the paper of sphereface. My result: Database : MS-Celeb Model : inception_resnet_v1 LFW Accuracy : 99.4% Rank 1 of MegaFace: 60.0%

But in the paper of sphereface, the Rank 1 of MegaFace is 72.7%.

JianbangZ commented 7 years ago

@zhengge How does your SphereFace implementations look like?

zhouhui1992 commented 7 years ago

@zhengge could you share your code?

auroua commented 7 years ago

SphereFace is hard to implement in tensorflow, because you can't do element-wise assignment in tensorflow. I think a better way is to implement it by tf.py_func or define your own layer.

yao5461 commented 7 years ago

@zhengge Hi, How to define a new score model when I evaluate my model on Megaface? The sphereface uses cosine similarity to measure the distance, but I don't know how to change the score model. Thanks!

xmuszq commented 6 years ago

@aleksandar @apollo-time

For the L2-NORM method, the author want to emphasize the hard samples by putting all samples (embedding) to the same norm. If that's the case, I think if we set the batch size to very small, like 1, it will have the same results, as each sample, including the hard samples, will also been emphasized??

billtiger commented 6 years ago

@zhengge How do you implement A-Softmax-Loss in tensorflow?could you like to share your SphereFace source code? thanks!good luck!

pppoe commented 6 years ago

I put up an implementation in TF. Comments & suggestions are highly welcomed. https://github.com/pppoe/tensorflow-sphereface-asoftmax

Erdos001 commented 6 years ago

I implemented the tensofrflow A-softmax, It's not hard, you just need to implement the fully connected layers in softmax by normalizing W and x before apply tf.matmul(Wx), then you can get the cos<Wi,x>.
if you know the COCO-loss, you will find you can easily implemente it in this way also. I firstly reproduce the COCO-loss, and found that, I can just use the one-hot label to implement the Sphereface

You can use the duplication angular formulation to get cos(m<Wi,x>), and afterwards, you just convert the label to one hot use one_hot_label = tf.one_hot(label_batch) in one batch, then use the tf.multiply(tf.matmul(Wx),one_hot_label), element-wise select the label-corresponding output of the softmax, thus you can get the output corresponding to the label location default

in the equation (6), you can get default and in this way, you insert cos(m<Wi,x>) to the label-corresponding location of the softmax output by multiply with one-hot label and just addition, so it's easy
sorry for I can not give the source code for our company's Right, but I think you can easily implement it just by one hot-label for select and duplication angular formulation

JianbangZ commented 6 years ago

I have implemented the cosine face and the latest arcface. Cosine is easier to converge and generates much stronger result than center loss.

zhenglaizhang commented 6 years ago

@JianbangZ agree with you, cosine performs much better in real data set test.

Heermosi commented 6 years ago

I implemented the tensofrflow A-softmax, It's not hard, you just need to implement the fully connected layers in softmax by normalizing W and x before apply tf.matmul(Wx), then you can get the cos<Wi,x>. if you know the COCO-loss, you will find you can easily implemente it in this way also. I firstly reproduce the COCO-loss, and found that, I can just use the one-hot label to implement the Sphereface

You can use the duplication angular formulation to get cos(m<Wi,x>), and afterwards, you just convert the label to one hot use one_hot_label = tf.one_hot(label_batch) in one batch, then use the tf.multiply(tf.matmul(Wx),one_hot_label), element-wise select the label-corresponding output of the softmax, thus you can get the output corresponding to the label location default

in the equation (6), you can get default and in this way, you insert cos(m<Wi,x>) to the label-corresponding location of the softmax output by multiply with one-hot label and just addition, so it's easy sorry for I can not give the source code for our company's Right, but I think you can easily implement it just by one hot-label for select and duplication angular formulation

Hello, should I re-train the model from scratch? how long would it be if I train it with 2 1080Ti on vgg2?

Heermosi commented 6 years ago

@JianbangZ agree with you, cosine performs much better in real data set test.

Really? I found MTCNN would stretch the extracted faces and some one differs in head shape shall become indistinguishable. Which face alignment method did you use?