Open sczhengyabin opened 7 years ago
Thanks! I will have a look at it. But what I could see the modification can be added before the tensorflow softmax, so hopefully it's not too cumbersome to implement.
FWIW -- it looks like they're about to release code for their models here
https://github.com/wy1iu/sphereface
I've tried to implement some of their ideas with some success, but not to the levels that they report in the paper yet. Excited to see their code to figure out where my implementation differs.
It's not particularly difficult to implement -- you basically just shrink the logit of the correct class label before applying the softmax. I'd guess it's easier than center loss.
@bkj Yeah, I noticed that.
However, kinda like a TF version rather than Caffe.
Yeah but it will provide a reference implementation that we can work from -- w/o code that produces the results reported in a paper, it's usually pretty difficult to get those numbers (eg because of hyperparameters and other things that aren't mentioned in the paper)
Hi guys, look at this paper: L2-constrained Softmax Loss for Discriminative Face Verification (https://arxiv.org/pdf/1703.09507.pdf). This is similar solution but easier to implement.
Yeah definitely easier to implement. It doesn't look like they do the same experiments as the Sphereface paper (train on CASIA, evaluate on LFW), so hard to compare head to head, but easy enough to implement and test.
AFAICT, there are various permutations of these losses that could be tried:
yes/no normalized weights
yes/no normalize features
yes/no margin penalty (and if yes, in what form)
In this paper they mentioned SphereFace method and reported better results.
Better results using 7x more training images -- 3.7M training images vs. 0.5M. The particularly notable thing IMO about SphereFace is they report good results (on both LFW and Megaface) using a relatively small training dataset.
L2-Softmax Loss was also trained on a 0.5M dataset(trained on MS-small instead of CASIA-Webface) and got 99.28% on LFW, which is lower than SphereFace's 99.42%.
In my opinion, L2-Softmax doesn't have much power to make intra-class variations smaller. That is why the performance can be further improved to 99.33% when used with center loss. I was curious whether the parameter m introduced in SphereFace can be added to L2-Softmax Loss, since it seems quite powerful in terms of intra-class variation reduction.
I trained this model with l2-softmax on casia set and got a lower results than standard David's loss function. I put normalization and scaling between two bottlenecks, and use suggested alfa param proposed by the paper for big training datasets. Maybe normalization after prelogits layer?
I've never done an implementation in Tensorflow.
How can I A-softmax with tensorflow?
@aleksandar I put normalization and scaling with alpha=20 after bottleneck and got slightly worse results as well. (trained on casia)
@ugtony me too here. I use alpha=24, do the normalization after prelogits just like embeddings, and use the normalized features before scale layer for testing. With softmax loss + center loss I can get around 98.5% on casia, but stucked at 98% when do l2 softmax + center loss. The author use batch size of 256 (2 gpus though) for training, so maybe a larger batch size would help?
Related question: has anyone ever been able to replicate the > 99% LFW accuracy from CASIA training reported in the center loss paper, and implemented in this repo:
https://github.com/ydwen/caffe-face
I tried training the models at got accuracies around 0.98, and even using their pretrained vectored only accuracies of 0.987.
@ugtony , @JianbangZ i used alpha 40 and have 99.6 LFW accuracy on random test-pairs. Do you maybe measure TPR@FAR 10e-3 or similar?
@aleksandar Could you explain more precisely what you did to get those results (architecture, training set, etc)? Could you post your code on a fork somewhere? And is that 99.6 on the standard LFW benchmark set (6000 pairs)?
training set is cleared casia without 3 subjects that overlaps with LFW set. architecture is inseption resnet v1 but i generated new random test pairs from lfw set. yes, its not comparable with other results :(
Could you run on the canonical set for comparison? Would be very interested.
validate_on_lfw results: Accuracy: 0.964+-0.005 Validation rate: 0.74667+-0.03141 @ FAR=0.00133 Area Under Curve (AUC): 0.994 Equal Error Rate (EER): 0.038
Ah interesting -- so still not getting close to the results reported in the paper. Thanks for running that benchmark.
On Wed, Jul 26, 2017 at 4:27 PM aleksandar notifications@github.com wrote:
validate_on_lfw results: Accuracy: 0.964+-0.005 Validation rate: 0.74667+-0.03141 @ FAR=0.00133 Area Under Curve (AUC): 0.994 Equal Error Rate (EER): 0.038
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidsandberg/facenet/issues/385#issuecomment-318172453, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzgfXhNKhPC6_Q077Kkr_4zxeyR5fgvks5sR6E6gaJpZM4Odooh .
Is it right that I implement L2-Softmax? I calculate the softmax on alpha scaled embeddings instead of prelogits.
prelogits, _ = network.inference(images, args.keep_probability,
phase_train=phase_train_placeholder, bottleneck_layer_size=args.embedding_size,
weight_decay=args.weight_decay)
embeddings = tf.nn.l2_normalize(prelogits, 1, 1e-10, name='embeddings')
if args.l2_softmax_alpha > 0:
prelogits = embeddings * args.l2_softmax_alpha
logits = slim.fully_connected(prelogits, len(train_set), activation_fn=None,
weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
weights_regularizer=slim.l2_regularizer(args.weight_decay),
scope='Logits', reuse=False)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits, name='cross_entropy')
@apollo-time Your version is quite similar to mine. I put l2_normalize after bottleneck layer instead. I think both versions are fine.
@ugtony Actually the network inference contains bottleneck layer, so that I think it is same with yours.
def inception_resnet_v2(inputs, is_training=True,
dropout_keep_prob=0.8,
bottleneck_layer_size=128,
reuse=None,
scope='InceptionResnetV2'):
####################
with tf.variable_scope('Logits'):
end_points['PrePool'] = net
#pylint: disable=no-member
net = slim.avg_pool2d(net, net.get_shape()[1:3], padding='VALID',
scope='AvgPool_1a_8x8')
net = slim.flatten(net)
net = slim.dropout(net, dropout_keep_prob, is_training=is_training,
scope='Dropout')
end_points['PreLogitsFlatten'] = net
net = slim.fully_connected(net, bottleneck_layer_size, activation_fn=None,
scope='Bottleneck', reuse=False)
return net, end_points
I'm training this on MS-Celeb and I'll share the result. And please teach me how to implement A-softmax with tensorflow @ugtony .
@apollo-time i did same
I share training result with l2 softmax on MS-Celeb datasets.
Runnning forward pass on LFW images Accuracy: 0.990+-0.005 Validation rate: 0.91100+-0.02621 @ FAR=0.00100
I think the result is bad than with softmax that accuarcy 0.992. But I see the regularization loss is smaller. Is it mean the l2-softmax is more generalization?
@apollo-time when you calculate the center losss ,you use the original prelogits or the prelogits after l2-softmax ?which network did you use?
@zhouhui1992 I calculate the center loss on l2 normalized embeddings, and I use inception_resnet_v1.
@apollo-time Have any of you tried training L2-SOFTMAX without center loss? What we are supposed to see according to the paper is that L2-softmax + center > softmax + center > softmax
@apollo-time Why did you use the l2 normalized embeddings to calculate the center loss rather than the original prelogits or scaled embeddings ? I use inception_resnet_v2, when i use L2-softmax to fine-tune my model trained on MS-cele-1M dataset ,the accuracy and val is always drop down. The best accuracy without l2-softmax is 0.994 and val_rate=0.9603@FAR=0.001, what is your best accuracy and val_rate on LFW?
@apollo-time @ugtony I think L2-softmax was experimented after the first FC layer which is high dimension(1792 etc.), Not sure if the reason we are not seeing improvements is b/c we are actually trying to apply normalization on reduced dimension bottleneck layer.
@JianbangZ I'm trying to train L2-softmax without bottleneck layer.
@apollo-time Please share the results when it's done. Thanks!
I share the result to train L2-softmax without bottleneck layer. Database : MS-Celeb Model : inception_resnet_v1 Final LFW Accuracy : 99.25%, val-rate : 0.9781
I did not test with A-softmax yet because I don't know how to implement it with tensorflow. :-(
hello everyone I trace the code on branch what's difference between angular_softmax_loss_decomp <- I have no idea code meaning angular_softmax_loss <- I think this is the sphere face implement
@jack55436001 I think those are some experimenting code David wrote in a new branch. The master branch doesn't contain those code.
@JianbangZ Thanks for your answer^^
Hi everyone, I use the A softmax, but I can not reproduce the experimental results in the paper of sphereface. My result: Database : MS-Celeb Model : inception_resnet_v1 LFW Accuracy : 99.4% Rank 1 of MegaFace: 60.0%
But in the paper of sphereface, the Rank 1 of MegaFace is 72.7%.
@zhengge How does your SphereFace implementations look like?
@zhengge could you share your code?
SphereFace is hard to implement in tensorflow, because you can't do element-wise assignment in tensorflow. I think a better way is to implement it by tf.py_func
or define your own layer.
@zhengge Hi, How to define a new score model when I evaluate my model on Megaface? The sphereface uses cosine similarity to measure the distance, but I don't know how to change the score model. Thanks!
@aleksandar @apollo-time
For the L2-NORM method, the author want to emphasize the hard samples by putting all samples (embedding) to the same norm. If that's the case, I think if we set the batch size to very small, like 1, it will have the same results, as each sample, including the hard samples, will also been emphasized??
@zhengge How do you implement A-Softmax-Loss in tensorflow?could you like to share your SphereFace source code? thanks!good luck!
I put up an implementation in TF. Comments & suggestions are highly welcomed. https://github.com/pppoe/tensorflow-sphereface-asoftmax
I implemented the tensofrflow A-softmax, It's not hard, you just need to implement the fully connected layers in softmax by normalizing W and x before apply tf.matmul(Wx), then you can get the cos<Wi,x>.
if you know the COCO-loss, you will find you can easily implemente it in this way also. I firstly reproduce the COCO-loss, and found that, I can just use the one-hot label to implement the Sphereface
You can use the duplication angular formulation to get cos(m<Wi,x>), and afterwards, you just convert the label to one hot use one_hot_label = tf.one_hot(label_batch) in one batch, then use the tf.multiply(tf.matmul(Wx),one_hot_label), element-wise select the label-corresponding output of the softmax, thus you can get the output corresponding to the label location
in the equation (6), you can get
and in this way, you insert cos(m<Wi,x>) to the label-corresponding location of the softmax output by multiply with one-hot label and just addition, so it's easy
sorry for I can not give the source code for our company's Right, but I think you can easily implement it just by one hot-label for select and duplication angular formulation
I have implemented the cosine face and the latest arcface. Cosine is easier to converge and generates much stronger result than center loss.
@JianbangZ agree with you, cosine performs much better in real data set test.
I implemented the tensofrflow A-softmax, It's not hard, you just need to implement the fully connected layers in softmax by normalizing W and x before apply tf.matmul(Wx), then you can get the cos<Wi,x>. if you know the COCO-loss, you will find you can easily implemente it in this way also. I firstly reproduce the COCO-loss, and found that, I can just use the one-hot label to implement the Sphereface
You can use the duplication angular formulation to get cos(m<Wi,x>), and afterwards, you just convert the label to one hot use one_hot_label = tf.one_hot(label_batch) in one batch, then use the tf.multiply(tf.matmul(Wx),one_hot_label), element-wise select the label-corresponding output of the softmax, thus you can get the output corresponding to the label location
in the equation (6), you can get and in this way, you insert cos(m<Wi,x>) to the label-corresponding location of the softmax output by multiply with one-hot label and just addition, so it's easy sorry for I can not give the source code for our company's Right, but I think you can easily implement it just by one hot-label for select and duplication angular formulation
Hello, should I re-train the model from scratch? how long would it be if I train it with 2 1080Ti on vgg2?
@JianbangZ agree with you, cosine performs much better in real data set test.
Really? I found MTCNN would stretch the extracted faces and some one differs in head shape shall become indistinguishable. Which face alignment method did you use?
Hi, @davidsandberg: Recent paper SphereFace: Deep Hypersphere Embedding for Face Recognition introduces a new novel LOSS definition to the face recognition training procedure, and it got a very decent result on MegaFace and LFW(99.42% according to author) by only training on CASIA-WebFace. I tried to add this loss to the facenet code, but I failed to translate it from paper into tensorflow code, cause it seems much complicated than the "Center Loss". Can you or anyone else try this? Exciting breakthrough maybe.