AIChallenger / AI_Challenger_2017

AI Challenger, a platform for open datasets and programming competitions to artificial intelligence (AI) talents around the world.
https://challenger.ai/
478 stars 196 forks source link

key point eval error #22

Closed zhaishengfu closed 7 years ago

zhaishengfu commented 7 years ago

hello, I see your file for keypoint eval---'_keypointeval.py', and I think your method has some problems. in your methods, if your image has 2 person, but you give only 1 and i use my model to predict 2, then the result will be bad, because your code:

_oks_all = np.concatenate((oks_all, np.max(oks, axis=0)), axis=0) oksnum += np.max(oks.shape)

I think this should change to _oks_all = np.concatenate((oks_all, np.max(oks, axis=1)), axis=0)

oksnum += np.min(oks.shape)

AIChallenger commented 7 years ago

@zhaishengfu Thanks for posting the issue. Indeed, we had considered using oks_num += np.min(oks.shape), but decided to pass this method, because it causes unfairness in the evaluation. For example, one way to get around is to predict ONLY ONE person with highest confidence score per image. In this case, oks_num is always equal to 1, and this means the final score is equal to that one OKS with highest confidence score, instead of the average score of all OKS in the given image.

Again, thanks for the advice. Good luck!

zhaishengfu commented 7 years ago

I Understand your meaning. But this is also unfair for some model. for example the followig image: 0a00c0b5493774b3de2cf439c84702dd839af9a2 in your evaluation only give 1 person, but my model can predict the other person and this is better than your given label, isn't it?? But the result is worse than model that can only predict 1 person. Then The best result can not represent the model is the best , and vice versa. I think this is really bad for your competition because i believe this is really common in your dataset! As for your mentioned case, I think we should do in the following way: For each annotations in your ground truth label, we look for all the predictions and find the best oks score. and the oks_num should be fixed always as your annotions !! For example, in the image, yout annotations has A joints array, and I predict A1,B1, then your will look A in A1, B1 and find the best oks in A1 and B1 for the similarity result , and the oks_num should be fixed as 1. I think using this method you can overcome your mentioned case the can choose the really good model!! Thanks and Looks for your reply!!

AIChallenger commented 7 years ago

@zhaishengfu again we did consider the case you just introduced. If we were to set oks_num equal to the number of human body annotated, and look for the best oks score from submission result, this would create a different dilemma. Hypothetically, all possible prediction results on one human body could have been submitted simultaneously, because in this case, the evaluation script always picks the better prediction and the rest have no negative impact to the mAP at all.

To prevent both cheating cases (1 prediction per image, or too many predictions per human body), we carefully pick the current evaluation metrics, where oks_num += np.max(oks.shape).

Thanks. Good luck!

laoxihongshi commented 7 years ago

我也刚想回答问这个问题来着,说实话,这个比赛大多数都是中国人,为什么两个中国人要用英语交流? 我更想帮主办方解释的是,我们不愿意改就是因为懒。 然后这个影响真的很大,在val上可以差别百分之十!!!!!!!! 从某种角度上,放出可见的框吗,让我们可以计算IOU,减去多余的人,可能更加公平。 谢谢。

zhaishengfu commented 7 years ago

看官方解释是防止欺骗。如果放出框的话,会降低很多难度。他的解释我倒是看懂了,有道理,但是还是没有办法解决我说的问题:给定的标签如果不准而模型更准,结果会更加糟糕。我觉得你说的不太可行,最好是既能解决他说的两个欺骗问题又能解决我说的问题。等想到再说吧。

AIChallenger commented 7 years ago

@zhaishengfu 您好,我们的测试数据集经过了专业数据数据标注团队的多轮人工审核,数据标注质量已经达到业界通用的高质量标准。因此无需担心“给定的标签如果不准而模型更准”的情况。

感谢您对AI Challenger的支持。预祝您取得好成绩!

foolwood commented 7 years ago

@AIChallenger 非常感谢组委会提供的数据与交流平台。我希望组委会可以认真考虑一下评价指标。

请将测评代码中的 oks_all = np.concatenate((oks_all, np.max(oks, axis=0)), axis=0)

改为 oks_all = np.concatenate((oks_all, np.max(oks, axis=1)), axis=0)

原因如下: 如果是 np.max(oks, axis=0),也就是说组委会是在精度的基础上进行的计算,这就留下了一个bug。因为是以精度为计算指标的,那么我可以只预测一个目标(假设这个预测和标签一模一样),将这个结果复制100w次提交。那么根据组委会的测试代码,假设真实的标签数量是10w这个级别。得到的是oksmean是(100w*1+9.9999w*0)/109.9999w 约等于0.9。

问题出在了,没有消除计算重复。一个标签目标可以匹配多个预测结果,而如果np.max(oks, axis=1)的话,每个预测的结果只能匹配一个组委会的标签(存在匹配两个标签的可能,但这种情况出现的时候,本来就是很低的oks,并不影响)。

请组委会在尽快进行修复,这样不会影响第一次双周赛。非常感谢。

组委会也可以先后台测试一下,有没有钻空子的一目了然。

foolwood commented 7 years ago

@AIChallenger 建议组委会还是尽量参考一下COCO的测评代码。

foolwood commented 7 years ago

@AIChallenger 再次提醒组委会修复bug。

foolwood commented 7 years ago

@AIChallenger 组委会请认真考虑bug修复问题。

tensorboy commented 7 years ago

I agree with @foolwood!

foolwood commented 7 years ago

@AIChallenger @tensorboy 目测组委会今天不会修这个bug了,第一次双周赛真是够刺激。

AIChallenger commented 7 years ago

@foolwood 非常感谢您指出代码中的疏漏部分,我们已经修复了这个问题。同时也对Github中的评测脚本做了更新。再次感谢您的大力帮助和支持,祝您在比赛中取得好成绩!