deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.81k stars 5.34k forks source link

iBUG_DeepInsight #49

Closed d4nst closed 6 years ago

d4nst commented 6 years ago

I have seen that the current top algorithm in the MegaFace challenge is iBug_DeepInsight, with an accuracy that corresponds with your latest update: 2018.02.13: We achieved state-of-the-art performance on MegaFace-Challenge-1, at 98.06

After reading your paper and the README in this repo, it seems to me that this accuracy is achieved using the cleaned/refined MegaFace dataset. Is this correct?

nttstar commented 6 years ago

Right.

d4nst commented 6 years ago

In that case, I don't think it's fair to publish those results in the public results page of the MegaFace challenge. As far as I know, it's not allowed to modify the evaluation set when reporting results, as this would obviously make impossible to compare the accuracy of the different algorithms. I suggest that you report the results obtained using the original dataset instead.

nttstar commented 6 years ago

I can not agree with you. First, it is impossible to guarantee any submitted result is fair unless it has published paper and open source code. I can have dozens of methods to cheat on it. Second, we followed all rules of MegaFace challenges but corrected the errors they made on distractor images. What we report is the real performance of any face recognition algorithm we had experimented. The accuracy can be random if we do not remove these distractor noises as we seen in our paper, which is actually not fair to compare the performance between two models/algorithms. Last, we made all things clear and open source, compared all published state-of-the-art algorithms then demonstrated that our approach performed the best.

ghost commented 6 years ago

It's worth to mention to be fair you should publish this cleaned data just to let other researchers to validate it.

d4nst commented 6 years ago

Don't get my wrong, I think it's great that you made everything clear and open source and as a fellow researcher I thank you for that!

As you say, it's very easy to cheat on these type of challenges. However, the assumption is that all the teams would work with the same test set and would not tamper the submitted results. Otherwise, as I said before, it would be impossible to compare the accuracy between algorithms. I agree with you that the best way to test is with a clean dataset and your paper clearly shows how this affect the accuracy of the algorithm. My problem with your submission is that nowhere in the MegaFace website says that you used a cleaned test set. Neither there is a link to this repo or to your paper, where this information is provided.

In my opinion, you should let the organisers know about this, so they can decide what to do. I think adding a note under the "Method details" section in the MegaFace website could work as well.

ghost commented 6 years ago

I think this affects the companies and the business of face technologies which cost us a huge budget, to be fair I believe you should delete this repository.

d4nst commented 6 years ago

@MartinDione I'm not sure what are you talking about but that has nothing to do with this discussion

ghost commented 6 years ago

I'm talking about the megaface objectives to develop the facial recognition. When you publish such this code you affect other companies, like Vocord they invested a lot of money on the development to achieve state-of-the-art performance on MegaFace.

d4nst commented 6 years ago

I won't even bother replying to that... Again, that has nothing to do with the issue we are discussing here.

nttstar commented 6 years ago

We will put the noise list on this repo soon. But read the notes carefully at that time.

nttstar commented 6 years ago

@MartinDione It is unbelievable if vocord spend a lot of money on a public competition like MegaFace.

ivazhu commented 6 years ago

@MartinDione Vocord didn't spend it :) @nttstar And what have you done with errors in FaceScrub subset which is used in Megaface Challenge?

nttstar commented 6 years ago

It was described in our paper.

ivazhu commented 6 years ago

@nttstar I have read your article and now I absolutely disagree that you are playing fair game. First of all you have changed test dataset, not deleted smth wrong, but changed FaceScrub "wrong" images with another images! More over you are writing that additional artificial features was added to your feature vectors that highlight "bad" images in distractor dataset. It is the same as using manual annotation! Megaface rules deny the first and the second. Also I would like to mention that your method of "cleaning" dataset creates "clear for your algorithm" dataset. I am sure that community will find mistakes in your "error" list when you publish it.

P.S. And have you thought that Megaface team have clear correspondence list? If they will recompute the results with it your team will take the last place cause images of different people will give you very high FAR (it is about your image replacing)

nttstar commented 6 years ago

@ivazhu I'm confused about the words 'deleted' and 'changed' in your comment. Anyway what you said was almost the same with d4nst's and I don't want to clear my position again. Megaface team had checked our list and solution, otherwise the result would not be on the leaderboard.

ivazhu commented 6 years ago

@nttstar From your article: "During testing, we change the noisy face to another right face" and "During testing, we add one additional feature dimension to distinguish these noisy faces" From Megaface Challenge: "1. Download MegaFace and FaceScrub datasets and development kit 2.Run your algorithm to produce features for both datasets"

In your article you declare that you are not using provided dataset and augmenting features with manual labels. It's obvious for all that you broke the Megaface rules.

nttstar commented 6 years ago

@ivazhu How can you achieve 91% without removing these noises? It's beyond my imagination.

chichan01 commented 6 years ago

@nttstar I have read your paper and also your noise list and codes under https://github.com/deepinsight/insightface/tree/master/src/megaface as well. I got a bit confuse.

In Your text, "We manually clean the FaceScrub dataset and finally find 605 noisy face images. During testing, we change the noisy face to another right face, which can increase the identification accuracy by about 1%. In Figure 6(b), we give the noisy face image examples from the MegaFace distractors. All of the four face images from the MegaFace distractors are Alec Baldwin. We manually clean the MegaFace distractors and finally find 707 noisy face images. During testing, we add one additional feature dimension to distinguish these noisy faces, which can increase the identification accuracy by about 15%."

In your noise list, megaface_noises.txt has 719 noisy face images and Facescrub has 605 noisy face images. In remove_noises.py, for facescrub set, the noisy image(feature) is replaced by the subject class center with random uniform noise. Do you really need random noise there? why?


your code for remove noise in facescrub set: center = fname2center[a] g = np.zeros( (feature_dim+feature_ext,), dtype=np.float32) g2 = np.random.uniform(-0.001, 0.001, (feature_dim,)) g[0:feature_dim] = g2 f = center+g _norm=np.linalg.norm(f) f /= _norm feature_path_out = os.path.join(args.facescrub_feature_dirout, a, "%s%s.bin"%(b, out_algo)) write_bin(feature_path_out, f)

However, for Megaface set, I don't know what you do in there. My first reading seems that you try to fill the feature with 100 for those noise images, but after I read your load_bin function, it is not the case as you update those filled 100 feature with the original extracted feature from the noise image.

your code about noise in megaface: feature = load_bin(feature_path, 100.0) write_bin(feature_path_out, feature)

and load_bin function: def load_bin(path, fill = 0.0): with open(path, 'rb') as f: bb = f.read(4*4)

print(len(bb))

v = struct.unpack('4i', bb)
#print(v[0])
bb = f.read(v[0]*4)
v = struct.unpack("%df"%(v[0]), bb)
feature = np.full( (feature_dim+feature_ext,), fill, dtype=np.float32)
feature[0:feature_dim] = v
#feature = np.array( v, dtype=np.float32)

print(feature.shape)

print(np.linalg.norm(feature))

return feature

  1. @ivazhu @nttstar Is there something I misunderstand? please give an advice. (it seems that your code is not what you describe in the paper.!)

  2. did you use your codes and list to reproduce your result for megaface or it is typo error? if you have update your code or lists, would you like to tell us what is your updated result on megaface. Please verify it on your pretrained model.

ivazhu commented 6 years ago

@nttstar First of all - WE DIDN'T CHANGE THE DATASET as you did. There are some secrets :) For instance, think what to do if you see more than one face on "error" distractor image.

And also, as I promised, take a look on Alley_Mills_52029, Lindsay_Hartley_33188, Michael_Landes_43643, ... These are not ERRORS. These are errors of your algorithm. In your "work" you simply deleted all samples which your alg was not working correctly on.

Any more questions?

chichan01 commented 6 years ago

@ivazhu , What do mean "more than one face on "error" distractor image"? In verification, it is a pair matching. also you do not know where image is from distractor or gallery(facescrub).

ivazhu commented 6 years ago

It can be more than one face on a image

ср, 28 февр. 2018 г. в 11:41, Chi Ho CHAN notifications@github.com:

@ivazhu https://github.com/ivazhu , What do mean "more than one face on "error" distractor image"? In verification, it is a pair matching, how can it be more than a face? Also what do you mean "one face on "error" distractor image"?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369199697, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHEGNeCkA4gdC53pVDicmjULn-fPGks5tZS1jgaJpZM4SEPY7 .

chichan01 commented 6 years ago

Do you mean that there can be more than a face in the image, no matter it is in facescrub subset or distractor set? So you do not use their provided json as a landmark reference in your case.

ivazhu commented 6 years ago

We didn't anything with Facescrub dataset cause there is not FAIR method to correct this type of errors. What about distractors - yes, there are some samples with more than one face - one face from Facescrub and one another.

About megaface json - read megaface docs - you should use megaface json in the case you can't detect a face only

ср, 28 февр. 2018 г. в 11:50, Chi Ho CHAN notifications@github.com:

Do you mean that there can be more than a face in the image, no matter it is in facescrub subset or distractor set? So you do not use their provided json as a landmark reference in your case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369201966, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHCu6Z7Q4ZS3izuVLAuj6mhg0nCPfks5tZS9dgaJpZM4SEPY7 .

chichan01 commented 6 years ago

I see. I think this can be a discussion. Originally, I thought json files is to tell participant where is a face in an image, so that alg. computes the similarity score of those faces (not images). Perhaps, some face locations in json files are incorrect, but this is the groundtruth of this challenge and therefore we should based on these error to provide our score.

Anyway, your trick may also not be good as you only do on mageface dataset, which means that any images more than one face can be denoted as disaster. In other words, you have already known the side information that one of your pair is disaster. so you can do whatever to make it as a good score for mismatch pair. I think this also violates the verification protocol as alg. should make a score for a testing pair without having any side information. Finally, is it what you did on your submitted work on megaface?

nttstar commented 6 years ago

@ivazhu Choosing one face from multiple faces in one image according to your own knowledge is also a data trick. It is the same as data noises cleaning.

nttstar commented 6 years ago

@chichan01 Adding random noises to centre vectors can avoid identical feature vectors. The result will not change if no noise applied.

chichan01 commented 6 years ago

@nttstar so would you like to tell me what you want to do on the disaster set..?

nttstar commented 6 years ago

@chichan01 What do you mean by 'want to do'?

chichan01 commented 6 years ago

I do not understand how you solve the noise in the disaster (megaface set). Please read what I post. However, for Megaface set, I don't know what you do in there. My first reading seems that you try to fill the feature with 100 for those noise images, but after I read your load_bin function, it is not the case as you update those filled 100 feature with the original extracted feature from the noise image. your code about noise in megaface: feature = load_bin(feature_path, 100.0) write_bin(feature_path_out, feature)

and load_bin function: def load_bin(path, fill = 0.0): with open(path, 'rb') as f: bb = f.read(4*4)

print(len(bb))

v = struct.unpack('4i', bb)

print(v[0])

bb = f.read(v[0]*4) v = struct.unpack("%df"%(v[0]), bb) feature = np.full( (feature_dim+feature_ext,), fill, dtype=np.float32) feature[0:feature_dim] = v

feature = np.array( v, dtype=np.float32)

print(feature.shape)

print(np.linalg.norm(feature))

return feature

nttstar commented 6 years ago

Maybe you're missing something... The solution is to add one more dimension with a large value(here 100.0) on those noisy distractor images. So the L2 distance of 'noisy distractors' to all facescrub images will be large enough.

v = struct.unpack('4i', bb)
feature = np.full( (feature_dim+feature_ext,), fill, dtype=np.float32)
feature[0:feature_dim] = v

v is the original feature. Shape(feature)=(513,) and feature_dim=512.

chichan01 commented 6 years ago

Thx.. so your feature dimension is 513 and the last dimension is 0 or 100 depending on whether it is noisy image or not.

nttstar commented 6 years ago

Right.

chichan01 commented 6 years ago

@nttstar since your code and the number of images in the list are not same in your paper, did you update your result on your pretrained model?

nttstar commented 6 years ago

We have duplicated items in original megaface noises file(megaface_image_path -> Facescrub_identity_name). For example, one megaface image with two facescrub names. But we're sure they belongs to at least one Facescrub identity.

chichan01 commented 6 years ago

@nttstar what do you mean? your paper->707 noisy face images in megaface your list-> 719 noisy face images your paper-> We manually clean the FaceScrub dataset and finally find 605 noisy face images. During testing, we change the noisy face to another right face your code->Adding random noises to centre vectors so you can see the difference.. I believe your result will be difference..

nttstar commented 6 years ago

do sort -u on the megaface noise list.

chichan01 commented 6 years ago

Thx, although I do not agree to clean the testing dataset, I should appreciate that you released the code for public, so that others can reproduce their result using your code and lists.

ivazhu commented 6 years ago

We did not choose according to your own "knowledge". We choose less similar from the image. So the process is automatic, no manual annotation.

Jia, and what about errors in your "error" list. Are you going to accept that your results are not fair?

ср, 28 февр. 2018 г. в 12:48, Jia Guo notifications@github.com:

@ivazhu https://github.com/ivazhu Choosing one face from multiple faces in one image according to your own knowledge is also a data trick. It is the same as data noises cleaning.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369215533, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHCoi9f7HZHCQPX_vxvtbO-XmtcAhks5tZT0MgaJpZM4SEPY7 .

nttstar commented 6 years ago

@ivazhu what do you mean by 'less similar from the image'?

chichan01 commented 6 years ago

@ivazhu Well, as I mentioned that you only works on disaster set and you use the face detection to detect those face. I can imagine that you can use the lowest confidence sore of face detection to represent a face in the image and then you can extract feature from that face and compute the so-called' less similar from the image'. In fact, it can be a non-face image. let say you have a very poor face detection alg... you can easy to achieve a good mismatch score by using this trick. please remember that you do not do it on facescurb.

In that extreme case, it can be equal to assign all the nonmatch score to the lowest simiarity. In this case, achieving higher than ibug result will be easy. Certainly, I do not think you did like that, but it can be a possible outcome. Also, as I mentioned, your trick is violate the rule of challenge as well.

ivazhu commented 6 years ago

No, you are not right. The detector for facescrub set and for distractors set is the same. I speak about recognition score. Imagine that you have two faces of different people on image and you know that one of it must be distractor for facescrub dataset - it is not very difficult to make a choice in this situation :)

ср, 28 февр. 2018 г. в 14:44, Chi Ho CHAN notifications@github.com:

@ivazhu https://github.com/ivazhu Well, as I mentioned that you only works on disaster set and you use the face detection to detect those face. I can imagine that you can use the lowest confidence sore of face detection to represent a face in the image and then you can extract feature from that face and compute the so-called' less similar from the image'. In fact, it can be a non-face image. let say you have a very poor face detection alg... you can easy to achieve a good mismatch score by using this trick. please remember that you do not do it on facescurb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369243202, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHI9ijLpYPzToA6zUgLKz07Fkv0hIks5tZVhNgaJpZM4SEPY7 .

nttstar commented 6 years ago

@ivazhu I can get 100% on Megaface challenge by using your approach - to detect dozens of faces from each megaface distractor image and choose the one with lowest similarity to facescrub dataset, even sometimes it's not a real face. You can't use the knowledge of facescrub dataset when you're doing something with megaface. I'm very surprised that you and vocord do think this is a normal procedure.

ivazhu commented 6 years ago

I wonder how then images from facescrub would be compared!

P.S. It is very funny to hear about 100% from the man who added a dimension with labels to his features

ср, 28 февр. 2018 г. в 16:25, Jia Guo notifications@github.com:

@ivazhu https://github.com/ivazhu I can get 100% on Megaface challenge by using your approach - to detect dozens of faces from each megaface distractor image and choose the one with lowest similarity to facescrub dataset, even sometimes it's not a real face. You can't use the knowledge of facescrub dataset when you're doing something with megaface. I'm very surprised that you and vocord do think this is a normal procedure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369273022, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHBpwgLiIYSIXOBxpv8hF9ChKuarrks5tZW8QgaJpZM4SEPY7 .

d4nst commented 6 years ago

I think this whole discussion proves my point. Both of your teams (Deepinsight and Vocord) have not strictly followed the MegaFace protocol, so it is pointless to compare the performance of your algorithms with the rest of participants.

ivazhu commented 6 years ago

Daniel, please show where I didn't follow the protocol?

ср, 28 февр. 2018 г. в 19:37, Daniel Saez notifications@github.com:

I think this whole discussion proves my point. Both of your teams (Deepinsight and Vocord) have not strictly followed the MegaFace protocol, so it is pointless to compare the performance of your algorithms with the rest of participants.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369338404, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHK46ermo9v2k_Z5y6JwTy-dfXu4_ks5tZZz3gaJpZM4SEPY7 .

d4nst commented 6 years ago

@nttstar and @chichan01 have already explained this, but I'll try to make it more clear...

As I understand, this is what you do: in the FaceScrub set, you always crop the correct face (using the provided landmarks or bounding box as a reference). In the MegaFace distractor set, you are cropping all the detected faces, comparing them against the probe face and selecting the lowest score as the "valid" score.

The problem with your approach is that you are using your knowledge about the origin of the image (probe set or distractor set) to make a decision. You know that a probe and a distractor image shouldn't match, so you just take the lowest score. As others have pointed out, you could use a poor face detector that doesn't even detect faces and this approach would give you a very high accuracy.

If you really wanted to take the lowest score, you should do it all the time, not just for distractor images, i.e. when you add the matching face from the probe set to the distractor set, you should also compare against all the detected faces and take the lowest score. If you do that, your performance will probably be much worse.

Lastly, just think about a real identification system in which you don't know anything about the origin of the faces. I'm sure that you would agree with me that always selecting the lowest score from all the detected faces would be a very poor design.

Please let me know if my assumptions about your approach are wrong.

ivazhu commented 6 years ago

Not quite. First of all I compute detector settings so that there is no fail detections in FaceScrub dataset. It let me hope that there is no such detections in Megaface dataset. To tell the truth I checked it manually - there is no fail detections in Megaface with computed settings. So I just choose the real distractor from two real face.

There is another more important problem - there are images without faces in Megaface dataset. Nevertheless we had to compute "features" by json-coordinate. These "features" are something completely different from face feature. I'd like to mention that the better learned algorithm than worse results would be on such samples.

P.S. Daniel, we don't talk about real identification system here. There are very different problem there :)

2018-02-28 22:06 GMT+03:00 Daniel Saez notifications@github.com:

@nttstar https://github.com/nttstar and @chichan01 https://github.com/chichan01 have already explained this, but I'll try to make it more clear...

As I understand, this is what you do: in the FaceScrub set, you always crop the correct face (using the provided landmarks or bounding box as a reference). In the MegaFace distractor set, you are cropping all the detected faces, comparing them against the probe face and selecting the lowest score as the "valid" score.

The problem with your approach is that you are using your knowledge about the origin of the image (probe set or distractor set) to make a decision. You know that a probe and a distractor image shouldn't match, so you just take the lowest score. As others have pointed out, you could use a poor face detector that doesn't even detect faces and this approach would give you a very high accuracy.

If you really wanted to take the lowest score, you should do it all the time, not just for distractor images, i.e. when you add the matching face from the probe set to the distractor set, you should also compare against all the detected faces and take the lowest score. If you do that, your performance will probably be much worse.

Lastly, just think about a real identification system in which you don't know anything about the origin of the faces. I'm sure that you would agree with me that always selecting the lowest score from all the detected faces would be a very poor design.

Please let me know if my assumptions about your approach are wrong.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepinsight/insightface/issues/49#issuecomment-369347584, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqfHF5Aa6xp95x8XK4_QjrfEl11I9pKks5tZaO2gaJpZM4SEPY7 .

chichan01 commented 6 years ago

@nttstar I have just found that you also released your result on FGNet megaface challenge as well. I have questions regarding on your result.

  1. Do you only use the same training set (i.e. your provided MS-1M-celeb) to train the deep network and test on both Facescrub and FGNet megaface challenges?
  2. If yes, it means that I can verify your pretrained model on both challenges. would you mind to tell me which pretrained model you released for testing on FGNet.?
  3. did you also clean the testing dataset as well? if yes, would you mind to release the list?
  4. will you update your paper by including the results of FGNet?
chichan01 commented 6 years ago

@ivazhu Would you mind to address what different problems between the megaface challenge and the real identification and verification here??

My point and others @d4nst and @nttstar agree that you only treat the disaster set specially and this is not the case in the real scenario as we don't have any side information of the image pair. I agree that @nttstar result cannot be comparable with other works because other participants did not do that, but they give the list, some codes and pretrained models, so that we can regard that they propose a new protocol and also verify their work. Therefore, the former and the later participants can do a bit extra work to follow this new protocol to produce the result if they want to.

On the other hand, your work will be much more difficult to follow and reproduce as you do not release things. Luckily, you pop up here so that I understand a bit of your work. Also, the most important is that your proposed trick violates the fundamental principle of biometric verification and identification.

happynear commented 6 years ago

It is a common knowledge in face recognition community that the absolute performance of MegaFace is meaningless. Lots of cheating tricks can be applied to achieve very high scores and MegaFace don't have the mechanism to prevent them.

What we can trust are the relative scores only, which means, one can only compare with himself on MegaFace. As in this issue, models evaluated on the cleaned list should only be compared with the ones evaluated on the same list. The authors have these experiments in their paper. That's already enough, not to mention that they released their codes and will release the cleaned list.

The only problem is that the official MegaFace organizer should make two leaderboards after they got aware of the existence of the "cleaned list". However, they didn't have the willings to do so and they chose to put them on the same leaderboard. This is the problem. The authors of InsightFace did nothing wrong.