deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.86k stars 5.35k forks source link

Asian training dataset(from glint) discussion. #256

Closed nttstar closed 1 year ago

nttstar commented 6 years ago
  1. Download dataset from http://trillionpairs.deepglint.com/data (after signup). msra is a cleaned subset of MS1M from glint while celebrity is the asian dataset.
  2. Generate lst file by calling src/data/glint2lst.py. For example:
    python glint2lst.py /data/glint_data msra,celebrity > glint.lst

or generate the asian dataset only by:

python glint2lst.py /data/glint_data celebrity > glint_cn.lst
  1. Call face2rec2.py to generate .rec file.
  2. Merge the dataset with existing one by calling src/data/dataset_merge.py without setting param model which will combine all IDs from those two datasets.

Finally you will get a dataset contains about 180K IDs.

Use src/eval/gen_glint.py to prepare test feature file by using pretrained insightface model.

You can also post your private testing results here.

YunYang1994 commented 6 years ago

@zhaowwenzhong 用dataset_merge.py脚本合并,令model=‘’

cysin commented 6 years ago

I tried to fine tune with triplet:

CUDA_VISIBLE_DEVICES='0,1,2' python -u train.py --network r50 --loss-type 12 --lr 0.005 --mom 0.0 --per-batch-size 96 --data-dir /data/glint_train/ --pretrained /data1/models/model-r50,1 --prefix /data2/models/model-m1-triplet

but got following error:

gpu num: 3
num_layers 50
image_size [112, 112]
num_classes 180855
Called with argument: Namespace(batch_size=288, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=9.052722677456407, ctx_num=3, cutoff=0, data_dir='/data/glint_train/', easy_margin=0, emb_size=512, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=5, incay=0.0, logits_verbose=0, loss_type=12, lr=0.005, lr_steps='', margin=4, margin_a=0.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.0, network='r50', noise_sgd=0.0, num_classes=180855, num_layers=50, output_c2c=0, patch='0_0_96_112_0', per_batch_size=96, per_identities=19, power=1.0, prefix='/data2/models/model-m1-triplet', pretrained='/data1/models/model-r50,1', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', train_limit=0, triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_act='prelu', version_input=1, version_output='E', version_se=0, version_unit=3, wd=0.0005)
loading ['/data1/models/model-r50', '1']
[19:17:40] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
init resnet 50
0 1 E 3 prelu
INFO:root:loading recordio /data/glint_train/train.rec...
header0 label [6753546. 6934401.]
id2range 180855
0 0 6753545
c2c_stat [0, 180855]
6753545
rand_mirror 1
5 19 3
(288,)
oseq 822654
lr_steps [71111, 106666, 142222]
/usr/lib/python2.7/site-packages/mxnet/module/base_module.py:490: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.333333333333 vs. 0.00347222222222). Is this intended?
  optimizer_params=optimizer_params)
call reset()
eval 3600 images.. 0
triplet time stat [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Traceback (most recent call last):
  File "train.py", line 1062, in <module>
    main()
  File "train.py", line 1059, in main
    train_net(args)
  File "train.py", line 1053, in train_net
    epoch_end_callback = epoch_cb )
  File "/usr/lib/python2.7/site-packages/mxnet/module/base_module.py", line 506, in fit
    next_data_batch = next(data_iter)
  File "/root/work/insightface/src/data.py", line 1010, in next
    ret = self.cur_iter.next()
  File "/root/work/insightface/src/data.py", line 860, in next
    self.reset()
  File "/root/work/insightface/src/data.py", line 726, in reset
    self.triplet_reset()
  File "/root/work/insightface/src/data.py", line 575, in triplet_reset
    self.select_triplets()
  File "/root/work/insightface/src/data.py", line 528, in select_triplets
    label[i-ba][:] = header.label
  File "/usr/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 444, in __setitem__
    self._set_nd_basic_indexing(key, value)
  File "/usr/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 706, in _set_nd_basic_indexing
    value = np.broadcast_to(value, shape)
  File "/usr/lib64/python2.7/site-packages/numpy/lib/stride_tricks.py", line 173, in broadcast_to
    return _broadcast_to(array, shape, subok=subok, readonly=True)
  File "/usr/lib64/python2.7/site-packages/numpy/lib/stride_tricks.py", line 128, in _broadcast_to
    op_flags=[op_flag], itershape=shape, order='C').itviews[0]
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,) and requested shape (1,)

Any idea about this? Thanks

zhaowwenzhong commented 6 years ago

@cysin rec 生成工具要改一下。(通过lst生成rec #265)

zhaowwenzhong commented 6 years ago
  Epoch[0] Epoch[1] Epoch[2] Epoch[3] Epoch[4] Epoch[5] Epoch[6] Epoch[7]
agedb_30 65.62+3.64 86.88+2.26 76.05+1.56 50.00+0.00 50.00+0.00 79.88+2.00    
cfp_ff 81.24+2.36 97.14+0.74 94.29+1.63 50.11+0.12 50.09+0.15 87.94+2.24    
cfp_fp 65.69+1.44 77.03+2.68 71.76+0.62 50.07+0.26 50.00+0.00 66.54+1.92    
lfw 81.63+1.86 97.67+1.01 94.42+1.09 50.18+0.30 50.00+0.00 90.53+1.10    
Train-acc 0.028951 0.050758 0.061312 0.066338 0.073008 0.078673    

从以上测试结果看,随着训练epoch的增加,测试精度在降低,比如lfw:81.63->97.67->94.42->50.18 ->50.00->90.53,但训练精度在提高,这种现象是不是过学习了,或者是否哪里有问题??我该尝试调整哪些参数?? 训练数据主要是msra+celebrity(每人照片数>=3张,大约18万人)

cysin commented 6 years ago

@zhaowwenzhong Did you mean the rec format used for triplet training is different from the one used for softmax training?

Edwardmark commented 6 years ago

I fine-tune on asian celebrity dataset, using the command below:

!/usr/bin/env bash

export MXNET_CPU_WORKER_NTHREADS=24 export MXNET_CUDNN_AUTOTUNE_DEFAULT=0 export MXNET_ENGINE_TYPE=ThreadedEnginePerDevice

NETWORK=r50 JOB=asian MODELDIR="../model-$NETWORK-$JOB" mkdir -p "$MODELDIR" PREFIX="$MODELDIR/model-asian" LOGFILE="$MODELDIR/log"

CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py \ --network "$NETWORK" \ --loss-type 0 \ --lr 0.005 \ --per-batch-size 64 \ --data-dir ../datasets/faces_asian_112x112 \ --pretrained ../models/model-r50-am-lfw/model,0000 \ --prefix "$PREFIX"
but I get the folowing warning:
UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.5 vs. 0.0078125). Is this intended? optimizer_params=optimizer_params)

Do you know what it means?@nttstar

YunYang1994 commented 6 years ago

@nttstar 您好,我将这两个数据集合并在一起得到约18万个ID,然后用您训练好的模型去抽取每个ID的中心特征向量,并计算两两之间的向量距离(cos值),并把大于0.85的ID抽取出来,结果发现有6417个ID对,即一共有6417对其余弦距离大于0.85的ID。我用人眼和百度识图大致过了下, 发现的确是同一个人。这就意味着合并得到的数据集里可能有一些ID重复了,目前我这里得到的是6417对ID重复了。

我不确定是我做法有误,还是数据集里本身不太干净?

@meanmee @406747925 @zhaowwenzhong

nttstar commented 6 years ago

@YunYang1994 不干净是可能的 你可以尝试对比下去重后训练和直接合并有什么区别. 我觉得差距应该几乎可以忽略.

Edwardmark commented 6 years ago

@nttstar 你好,请问一下我使用1080TIx4为啥速度只有30samples/sec,最高也就60samples/sec,看大家的速度起码有300到1000张/sec。请问训练速度很慢的原因可能是什么?谢谢

zchflyer commented 6 years ago

@YunYang1994 每个ID的中心特征向量是怎么获取的

tornadomeet commented 6 years ago

@nttstar 我看描述提供的是align的图片,有非align的图片集么?谢谢!

YunYang1994 commented 6 years ago

@zchflyer 里面有源码,计算每个Id的中心向量即可

becauseofAI commented 6 years ago

Why do I get such low results(Identification is only 0.01270) on TrillionTairs of Glint? Maybe I did not generate the correct result. I use the code src/eval/gen_glint.pyto get the bin file for submits. But maybe the code can not to ues directly, I modify it as follow: The original code in gen_glint.py:

image_path, label, bbox, landmark, aligned = face_preprocess.parse_lst_line(line)
buffer.append( (image_path, landmark) )

The original code in src/common/face_preprocess.py:

def parse_lst_line(line):
  vec = line.strip().split("\t")
  assert len(vec)>=3
  aligned = int(vec[0])
  image_path = vec[1]
  label = int(vec[2])
  bbox = None
  landmark = None
  #print(vec)
  if len(vec)>3:
    bbox = np.zeros( (4,), dtype=np.int32)
    for i in xrange(3,7):
      bbox[i-3] = int(vec[i])
    landmark = None
    if len(vec)>7:
      _l = []
      for i in xrange(7,17):
        _l.append(float(vec[i]))
      landmark = np.array(_l).reshape( (2,5) ).T
  #print(aligned)
  return image_path, label, bbox, landmark, aligned

I modify the gen_glint.py to:

    image_path, landmark = face_preprocess.parse_lst_line(line)  
    image_path = "/to/my/path/TrillionPairs/testdata/"+line.split(" ")[0]
    buffer.append( (image_path, landmark) ) 

and modify the src/common/face_preprocess.py to:

def parse_lst_line(line):
  vec = line.strip().split(" ")
  assert len(vec)>=2
  image_path = vec[0]
  landmark = None
  #print(vec)
  if len(vec)>2:
    _l = []
    for i in xrange(1,11):
      _l.append(float(vec[i]))
    landmark = np.array(_l).reshape( (2,5) ).T
  #print(aligned)
  return image_path, landmark

My input is:

--input='/to/my/path/TrillionPairs/testdata/testdata_lmk/testdata_lmk.txt'

Because the input testdata_lmk.txt format is:

testdata/00/00/00000d7e95948372025bdaca5a203832.jpg 153.4 180.0 246.6 180.0 196.8 215.8 158.5 278.7 230.6 277.6
testdata/00/00/00000f9f87210c8eb9f5fb488b1171d7.jpg 156.1 180.0 243.9 180.0 207.4 229.2 159.8 262.9 237.4 263.0
testdata/00/00/000010e4c136b77a07eeeea84d84d804.jpg 156.4 180.0 243.6 180.0 201.6 223.0 168.0 264.7 237.7 268.0

So I think that my modify is right, and I got the result size of bin file about 1.8G.

I don't know what's wrong with it, if someone can find my problem or provide available code directly?

Any help will be grateful! @nttstar

Edwardmark commented 6 years ago

@nttstar When I run src/eval/gen_glint.py, I observe that the memory used is constantly increasing which is weired, is that normal? And another question, what does the following line mean? https://github.com/deepinsight/insightface/blob/master/src/eval/gen_glint.py#L131
When I run the code, I got following error:
sh: 1: bypy: not found
Please help me out, thank you very much.

yhw-yhw commented 6 years ago

@becauseofAI I have the same problems with you, my network training accuracy gets 0.82 while using whole deepglint datasets, however, I submit my result and get 0.016 identity results. @nttstar

AaronYKing commented 6 years ago

@yhw-yhw How did you get the result file of .bin? If you use the code src/eval/gen_glint.pyalso, did you modify it somewhere? And do you know what the file of Trillion Pairs/testdata/feature_tools/matio.pydownloaded with the Dataset is for?

AaronYKing commented 6 years ago

@nttstar I use the same modify with @becauseofAI to generate the result through using the model of LResNet100E-IR|Emore in Model-Zoo, but only gets 0.00178 identity results. Can you test with it and share you result and code with us?

nttstar commented 6 years ago

@becauseofAI @yhw-yhw @AaronYKing Can you give us a complete right way to generate the correct submit file? I'm sorry that recently I have no time to test it. Thanks~

Edwardmark commented 6 years ago

有人测试出结果吗?我把bin上传到deepglint官网,传完数据就卡死在那个页面了,没有反应,在result界面也没有结果。求助大家,该怎么操作。

goodpp commented 6 years ago

@nttstar @becauseofAI @yhw-yhw I have the same problems, I submit my result and get 0.01465 identity results. I have not modified the code, My step is :

  1. src/data/glint2lst.py /xxx/glint testdata > /home/xxx/glint_test.lst
  2. src/eval/gen_glint.py --input /home/xxx/glint_test.lst --output my_result.bin {...other param}
yhw-yhw commented 6 years ago

@Edwardmark 我也遇到过,注销下重新登录就好了,如果还没有可能是数据没传成功。

becauseofAI commented 6 years ago

@nttstar @yhw-yhw @AaronYKing I have upload the code to generate submit file on GoogleDrive. You need to put it in the directory of insightface/src/eval/and you can use the model of LResNet100E-IR|Emore in Model-Zoo to generate the submit file. But maybe the code have something with wrong, It only gets 0.00178 identity results. Anyone who can check the code to solve the problom will be grateful!

tornadomeet commented 6 years ago

@nttstar could glint provide the original non-aligned face data set?

Edwardmark commented 6 years ago

我测试了在celebrity数据集上微调的结果,结果见下,应该是测对了,但是效果很差。。。 verification@1e-9:res1 identification@1e-3:res2
代码就使用@nttstat给出的,不需要任何修改。

Edwardmark commented 6 years ago

@yhw-yhw 多谢,请问你有遇到如下的问题吗?下面这行代码是什么意思?https://github.com/deepinsight/insightface/blob/master/src/eval/gen_glint.py#L131 When I run the code, I got following error: sh: 1: bypy: not found Please help me out, thank you very much.

Edwardmark commented 6 years ago

@cysin@zhaowwenzhong 生成triplet训练数据的rec要改rec的格式吗?我也遇到了如下问题:
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,) and requested shape (1,)

yhw-yhw commented 6 years ago

@Edwardmark bypy 是百度云上传的脚本命名,可以不用管的; 另外我在使用triplet loss训练glint data时,也遇到这样的问题, 这是两个数据集生成rec格式不同; 解决方法: 修改 data.py 中 528,529行: https://github.com/deepinsight/insightface/blob/master/src/data.py#L528 label[i-ba][:] = header.label tag.append( ( int(header.label), _idx) ) 为 label[i-ba][:] = header.label[0] tag.append( ( int(header.label[0]), _idx) ) 即可;

Edwardmark commented 6 years ago

@yhw-yhw @nttstar 好的,还有个问题就是直接运行:
python glint2lst.py /data/glint_data msra,celebrity > glint.lst 生成两个数据集合并起来的list,然后运行:
python src/data/face2rec2.py ${path-to-glint-data-and-glint-lst}(该路径下包含glint.lst),rec文件不是已经生成完了吗?还需要运行merge.py吗?为什么要运行merge.py呢?

Edwardmark commented 6 years ago

@yhw-yhw 十分感谢,还有就是我不太明白为何要运行merge.py合并数据,运行python glint2lst.py /data/glint_data msra,celebrity > glint.lst,不是就生成了list了吗?直接根据这个list生成rec不就好了吗?为啥还要合并呢?

nttstar commented 6 years ago

这是合并其他数据集用的

Edwardmark commented 6 years ago

@nttstar多谢您的耐心回复,十分感谢。

Edwardmark commented 6 years ago

@nttstar @yhw-yhw 请问您一下,对于这种18万类的分类,使用softmax loss以及其改进训练或者微调是不是很难达到较好的效果呢?对于该问题,是否应该直接使用triplet loss在glint数据上微调即可呢?我使用arcface的损失函数在r50模型上对glint数据微调了一天,发现训练准确率一直在0附近,测试准确率也下降很多。希望跟您讨论一下对这种类别很多的多分类任务有什么较好的方法。

yhw-yhw commented 6 years ago

@Edwardmark 用arcface loss对作者release的r50模型再glint数据集上进行finetune,我目前训练了10w iteration,使用的lr 是 0.0001,目前acc为0.55; 这个lr还需要再调,我的经验是lr的选取很重要,一般在很多id数据集上训练,先用 0.01训练10w iteration,再用0.001训练20w iteration,再用0.0001训练10w iteration基本上就能到一个非常好的结果,最后0.00001训练一段时间,acc就不会变,我的batch size是128;

Edwardmark commented 6 years ago

@yhw-yhw 多谢,我之前在作者的r50上仅使用celebrity fine-tune过,acc在58%,上传glint官网仅有18%的准确率,请问您训练的模型有上传glint测试吗?结果如何?

yhw-yhw commented 6 years ago

@Edwardmark 我使用整个数据集训练时,finetune r50,training acc为50%, glint测试结果也只有 16%,好多人都遇到这样的问题,很奇怪,目前我在分别使用ms1m,celebrity训练测试下。

Edwardmark commented 6 years ago

@yhw-yhw ,您好,我使用triplet fine-tune的时候,将batch设为120,使用4卡训练,出现下面错误,请问您遇到过吗?
Traceback (most recent call last): File "train.py", line 1062, in main() File "train.py", line 1059, in main train_net(args) File "train.py", line 1053, in train_net epoch_end_callback = epoch_cb ) File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 506, in fit next_data_batch = next(data_iter) File "/mntML/dongbin/insightface/src/data.py", line 1011, in next ret = self.cur_iter.next() File "/mntML/dongbin/insightface/src/data.py", line 861, in next self.reset() File "/mntML/dongbin/insightface/src/data.py", line 727, in reset self.triplet_reset() File "/mntML/dongbin/insightface/src/data.py", line 576, in triplet_reset self.select_triplets() File "/mntML/dongbin/insightface/src/data.py", line 545, in select_triplets embeddings[ba:bb,:] = net_out ValueError: could not broadcast input array from shape (480,512) into shape (240,512)

Edwardmark commented 6 years ago

有人在glint训练后,提交结果比较好的吗?我不管怎么训练在测试集上效果很好,但是上传上去却结果很差,不知道是什么原因。@nttstar@becauseofAI@yhw-yhw,今天换用triplet loss 微调后,结果从18%上涨到20%,但是还是不如不进行调节的r50模型,r50模型我上传上官网获得了48%左右的准确率。

nttstar commented 6 years ago

We will try to ask Glint to check recent test results soon

xmuszq commented 6 years ago

Does this dataset is better than the one provide by @nttstar ?

Edwardmark commented 6 years ago

@nttstar 大侠,请问有向glint反映吗?是代码问题还是说确实是训练效果太差呢?

Edwardmark commented 6 years ago

有人在glint训练后,提交结果比较好的吗?我不管怎么训练在测试集上效果很好,但是上传上去却结果很差,不知道是什么原因。@becauseofAI@yhw-yhw,今天换用triplet loss 微调后,结果从18%上涨到20%,但是还是不如不进行调节的r50模型,r50模型我上传上官网获得了48%左右的准确率。

wenjie710 commented 6 years ago

@Edwardmark 你获得48%的准确率用的模型直接是它这里提供的LResNet50E-IR吗?用来生成bin 的代码也是gen_glint.py吗?还是做了别的什么改进? 我用LResNet50E-IR模型提的特征上传只有31%的准确率.

JianbangZ commented 6 years ago

@xsr-ai 你做完平衡还有15W?我看了一下ac_glint 长尾很严重,大部分图片都只有几张20张

HaoLiuHust commented 6 years ago

@xsr-ai 数据均衡这块怎么弄

YunYang1994 commented 6 years ago

@nttstar 您好, 请问您训练集上的accuracy大概能达到多少呢?因为我发现我训练集上的准确率很低, 但是lfw上的准确率很高。

Edwardmark commented 6 years ago

@nttstar @yhw-yhw 请问一下,有人训过triplit loss吗?为啥感觉完全不收敛啊

Edwardmark commented 6 years ago

请问一下,有人训过triplit loss吗?为啥感觉完全不收敛啊。虽说是online-hard-negtive-mining,但是总得有个整体的趋势吧?感觉一直不降啊,有啥好办法吗?

goodpp commented 6 years ago

@Edwardmark @yhw-yhw 请问下你们解决了自己训练的模型在glint训练后测试结果不好的问题了吗? 我试了好多次自己的模型就是不行。。。, 自己glint训练的模型在其他测试包括megaface上都没有问题

Name TPR@FPR=1e-3 metric
基准demo 0.43883 cos
Pretrained r34 0.49736 cos
Pretrained r50 0.49473 cos
自己glint_r34 0.01465 cos
自己MS1M_r34 0.50138 cos

我生成测试文件的步骤都是一样的,只是模型不一样

  1. src/data/glint2lst.py /xxx/glint testdata > /home/xxx/glint_test.lst
  2. src/eval/gen_glint.py --input /home/xxx/glint_test.lst --output my_result.bin {...other param} 补充下:今天试了下之前自己用refined MS1M训练复现的r34模型,结果没有问题,而且效果不错,iden.=0.50138 veri.=0.53127...
Edwardmark commented 6 years ago

@goodpp I solved it by use triplet loss.我微调后测试结果为48%,比原来的有所下降,但是还算正常。

goodpp commented 6 years ago

@Edwardmark 谢谢,我也试试