deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.86k stars 5.35k forks source link

Asian training dataset(from glint) discussion. #256

Closed nttstar closed 1 year ago

nttstar commented 6 years ago
  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
    python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.

Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.

You can also post your private testing results here.

aaaaaaaak commented 6 years ago

Thanks for Sharing

cysin commented 6 years ago

@nttstar will you train new models with these data?

406747925 commented 6 years ago

业界良心

meanmee commented 6 years ago

有格林的人在吗,我是楼下比特大陆的,下载速度太慢了,可以直接去楼上直接拷贝吗?

lmmcc commented 6 years ago

is there exist any same person in msra & celebrity datasets?

aa12356jm commented 6 years ago

前几天刚听了格林的讲座,公开了这个数据集,数据集刚下载好,几百个GB,没想到这里这么快就出现了,感谢

JianbangZ commented 6 years ago

After test, this dataset is pretty clean, but still containing 0.3%~0.8% noise. Also, we found their ms1m and Asian parts still have about 15-30 overlaps, though I guess it doesn't matter when the scale is already so large. Another findings is that this dataset suffers long tail a lot. Take the asian part for example, only 18K identites out of 10K have over 25 images per class, and only few thousand identities have over 60 images.

meanmee commented 6 years ago

@aa12356jm could you share it on BaiduYun?

zhenglaizhang commented 6 years ago

@nttstar I download the dataset from glint, it looks like the face is similarity transformed and resized to 400x400, so for arcface, how to crop/resize this to 112x112?

nttstar commented 6 years ago

@zhenglaizhang I already provided the scripts.

HaoLiuHust commented 6 years ago

@JianbangZ do you have some idea to solve these problems?

starimpact commented 6 years ago

awesome !

devymex commented 6 years ago

Thanks DeepGlint!

xxllp commented 6 years ago

有意义

vzvzx commented 6 years ago

@nttstar the download address is crashed.

libohit commented 6 years ago

@nttstar @JianbangZ how can you download glint asian face dataset? I can not find how to register and signup.

aaaaaaaak commented 6 years ago

@nttstar @JianbangZ 为何我下载的亚洲人脸数据集只能解压出1.7G 2000+id的人脸 这个90+G的.tar.gz文件该怎么处理呢 能否指导一下 多谢

anguoyang commented 6 years ago

there is no lmk files in the dataset: "lmk_file = os.path.join(input_dir, "%s_lmk.txt"%(ds))" is it correct?

Wisgon commented 6 years ago

The same problem with @libohit , I can't sign in http://trillionpairs.deepglint.com/data, the button of "sign in" is dark!

anguoyang commented 6 years ago

@Wisgon maybe you need to use another browser

wangchust commented 6 years ago

Can anyone share a copy of lmk files? Their official site seems to be maintaining. I cound download nothing.

goodpp commented 6 years ago

现在下载不了,是什么情况@—@

Wisgon commented 6 years ago

I can't download the dataset, when I click the Download button, there is some error appear: `This XML file does not appear to have any style information associated with it. The document tree is shown below.

InvalidAccessKeyIdThe OSS Access Key Id you provided is disabled.5B31AE6FF68A5D785875635Ddgplaygroundopen.oss-cn-qingdao.aliyuncs.comLTAIKdTReMdV71Zi

`

shineway14 commented 6 years ago

I can't sign in http://trillionpairs.deepglint.com/data, the button of "sign in" is dark!

Wisgon commented 6 years ago

@shineway14 You can use http://trillionpairs.deepglint.com/login to sign in, when you finsh fill in the blanks, press enter instead of the 'log in' button. BTW, http://trillionpairs.deepglint.com/register to register

meanmee commented 6 years ago

@nttstar what is the exactly script to merge msra and celeb?

goodpp commented 6 years ago

@aaaaaaaak 我最新用BT下载的亚洲人脸数据集是正常的,和官方提供的数据一致,我给你参考下我的 目录大小 98G ./asian-celeb 人数 ls -lR| grep "^d" | wc -l 93979 图片数 ls -lR |grep "^-"| grep ".jpg" |wc -l 2830146

jackytu256 commented 6 years ago

HI all, I have already done the step1, which is to get glint_cn file; however, I got a error while trying to do step 2. The error code is following and please please help me to fix this issue. Thanks.

OpenCV Error: Assertion failed (src.cols > 0 && src.rows > 0) in warpAffine, file /build/buildd/opencv-2.4.8+dfsg1/modules/imgproc/src/imgwarp.cpp, line 3445
Traceback (most recent call last):
  File "face2rec2.py", line 256, in <module>
    image_encode(args, i, item, q_out)
  File "face2rec2.py", line 99, in image_encode
    img = face_preprocess.preprocess(img, bbox = item.bbox, landmark=item.landmark, image_size='%d,%d'%(args.image_h, args.image_w))
  File "../common/face_preprocess.py", line 107, in preprocess
    warped = cv2.warpAffine(img,M,(image_size[1],image_size[0]), borderValue = 0.0)
cv2.error: /build/buildd/opencv-2.4.8+dfsg1/modules/imgproc/src/imgwarp.cpp:3445: error: (-215) src.cols > 0 && src.rows > 0 in function warpAffine
anguoyang commented 6 years ago

@jackytu256 请问你下载的包里面有lmk文件么? lmk_file = os.path.join(input_dir, "%s_lmk"%(ds))

jackytu256 commented 6 years ago

@anguoyang Yes, I got a file called celebrity_lmk

YunYang1994 commented 6 years ago

您好,谢谢您将程序开源,但是现在我遇到这样一个问题:因为我要吧msra和celebrity数据合并在一起,但是我发现于微软的celebrity数据库好像没有property文件?期待您的解答,谢谢

nttstar commented 6 years ago

property自己写一下, 格式: <数据集总人数>,112,112

wangchust commented 6 years ago

各位有谁上传结果成功过的吗?上传了好几次,都没有结果

TopcoderX commented 6 years ago

@wangchust 换一个浏览器?我用的chrome可以。

wangchust commented 6 years ago

@TopcoderX 请问上传就只需要上传自己的bin文件吗,还有结果是在哪里看啊:)

TopcoderX commented 6 years ago

@wangchust 对上传bin文件即可,结果 results页面就可以看到。

wangchust commented 6 years ago

@TopcoderX 谢谢!

goodpp commented 6 years ago

How can I improve my training accuracy? 有人分享训练的情况吗?下面是我的

  1. dataset: msra + celebrity
  2. network backbone: r34 ( output=E, emb_size=512, prelu )
  3. loss function: arcface(m=0.5)
  4. training pipeline: batch_size=384, pre_batch_size=96(4GPUx12G), verbose=2000
  5. Highest LFW: 99.767%; Highest CFP_FP: 93.829%; Highest AgeDB30: 97.567% (epoch=14) megaface: 96.1564%

测试的准确率不错,但是训练的准确率现在上不去了,接近60%,如下: INFO:root:Epoch[14] Batch [16000] Speed: 499.24 samples/sec acc=0.583464 INFO:root:Epoch[18] Batch [8380] Speed: 499.95 samples/sec acc=0.596615

INFO:root:Epoch[11] Train-acc=0.500434 INFO:root:Epoch[12] Train-acc=0.570312 INFO:root:Epoch[13] Train-acc=0.575087 INFO:root:Epoch[14] Train-acc=0.578993 INFO:root:Epoch[15] Train-acc=0.584201 INFO:root:Epoch[16] Train-acc=0.582465 INFO:root:Epoch[17] Train-acc=0.595920

1初始lr=0.1,没有设置lr_step, 但看到日志里有lr_steps [133333, 186666, 213333],目前学习率已经降到了0.0001,按我的理解继续训练下去的,训练准确率也不会提升了,不知道这样理解对不? 2 目前情况下,我该如何提升训练准确率了?更换更小的学习率继续训练?需要更换为TripletLoss吗? 3 有必要继续再提升训练准确率吗?

mengzhibin commented 6 years ago

For those people who cannot login the web site, you can change to chrome for login.

zhaowwenzhong commented 6 years ago

face_emore 和 celebrity 能否放在一起训练?两个数据集有交集吗?(同一人ID不同)

nttstar commented 6 years ago

@zhaowwenzhong 可以 直接合并即可.

zhaowwenzhong commented 6 years ago

在用tripletloss 做finetune 时,我发现输出日志中“call reset”,这是否正常?? call reset() eval 4200 images.. 12600 triplet time stat [0.00022899999999999998, 27.907889, 5.54027, 0.0, 0.0, 0.0] found triplets 1873 seq len 5550 INFO:root:Epoch[0] Batch [30] Speed: 124.86 samples/sec lossvalue=0.185821 INFO:root:Epoch[0] Batch [32] Speed: 931.34 samples/sec lossvalue=0.058671 INFO:root:Epoch[0] Batch [34] Speed: 648.38 samples/sec lossvalue=0.066276 INFO:root:Epoch[0] Batch [36] Speed: 638.66 samples/sec lossvalue=0.058656 INFO:root:Epoch[0] Batch [38] Speed: 636.64 samples/sec lossvalue=0.067309 call reset() eval 4200 images.. 16800 triplet time stat [0.00036899999999999997, 34.373402, 7.256107, 0.0, 0.0, 0.0] found triplets 1731 seq len 5100 INFO:root:Epoch[0] Batch [40] Speed: 124.52 samples/sec lossvalue=0.188634 INFO:root:Epoch[0] Batch [42] Speed: 673.45 samples/sec lossvalue=0.078769 INFO:root:Epoch[0] Batch [44] Speed: 645.02 samples/sec lossvalue=0.061025 INFO:root:Epoch[0] Batch [46] Speed: 628.74 samples/sec lossvalue=0.063693 call reset() eval 4200 images.. 21000 triplet time stat [0.000468, 41.096862, 9.061286, 0.0, 0.0, 0.0] found triplets 1864 seq len 5550 INFO:root:Epoch[0] Batch [48] Speed: 118.35 samples/sec lossvalue=0.202036 INFO:root:Epoch[0] Batch [50] Speed: 681.18 samples/sec lossvalue=0.073124 INFO:root:Epoch[0] Batch [52] Speed: 642.58 samples/sec lossvalue=0.074811 INFO:root:Epoch[0] Batch [54] Speed: 640.49 samples/sec lossvalue=0.067620 call reset()

nttstar commented 6 years ago

data iter重置的输出 忽略即可

mengzhibin commented 6 years ago

It's too slow downloading, next time, please dont use jar file.

anguoyang commented 6 years ago

@goodpp can you share the model?

YunYang1994 commented 6 years ago

请问合并msra和celebrity的数据集能得到多少identity呢,我合并只得到了不到10万人. celebrity 93979 个id msra 85164个id python /insightface/src/data/dataset_merge.py --include ~/data/celebrity/ , ~/data/msrc/ --output ~/data/combined/

我看代码里是有一个去重过程的,所以我想问一下,根据您设置的阈值,我得到这个合并后的数据集(id已去重)的大小应该是对的吧?

期待您的解答,万分感谢!

zhaowwenzhong commented 6 years ago

作者不是说“ 可以 直接合并即可.” 嘛!!! 直接合并我的理解事把两个集的数据直接以ID区分就可以了。

YunYang1994 commented 6 years ago

@nttstar @zhaowwenzhong 直接合并不是用/insightface/src/data/dataset_merge.py这个脚本来合并吗?我合并的时候celebrity和msra发现:在celebrity的基础上只增加了不到1000个id,合并时采用的阈值是默认的。

nttstar commented 6 years ago

model参数留空 直接合并

zhaowwenzhong commented 6 years ago

" 直接合并" 是 不是可以这样做 数据集:celebrity的ID是86876->180854(查看celebrity_lmk) 数据集:msra的ID是0->86875 两个数据集放一起,以ID区分每个人。(我目前是这样做的,没有用到dataset_merge.py,不知道这样做对不对???,我目前还在用这些数据训练过程中,还不知结果如何) @nttstar @YunYang1994