Convergence problem training with arc_torch/train_one_gpu.py

hhhThomas commented 3 years ago

I am training a mobilefacenet with CASIA_WebFace dataset downloaded from insightface Dataset Zoo. However, I have trouble getting it converged. The loss stops going down after a few steps. Below are my config settings. I have also tried training with cosface, setting batch size=128, lr=0.001, and s=32 in arcface/cosface loss. Am I missing something? Could you give me some advice or share your settings? I would greatly appreciate it if you could help me out.

config = edict() config.loss = "arcface" config.network = "mbf" config.resume = False config.output = None config.embedding_size = 512 config.sample_rate = 1.0 config.fp16 = True config.momentum = 0.9 config.weight_decay = 2e-4 config.batch_size = 64 config.lr = 0.01 # batch size is 512

config.rec = "./train_tmp/faces_webface_112x112" config.num_classes = 10572 config.num_image = "forget" config.num_epoch = 34 config.warmup_epoch = -1 config.decay_epoch = [8, 12, 15, 18] config.val_targets = ["lfw", "cfp_fp", "agedb_30"]

eeric commented 3 years ago

sorry, using one gpu through model can be trained, verification result was little growth, so advice to be not use.

hhhThomas commented 3 years ago

Hi @eeric ,

Thanks for your reply. Do you know if the problem comes from the training stage or the verification stage? Besides, I would like to try it on a multi GPU server. Do you have any idea of how many GPUs I need and how large the RAM should be?
Ex. If I want to train a mobileFaceNet with MS1M-RetinaFace dataset(93K ids), at least how much RAM do I need? Could you please give me some suggestions? Thank you.

eeric commented 3 years ago

1.no less than 2 gpus, volume 11G every gpu;

hhhThomas commented 3 years ago

Excuse me, one more question. Could I also inquire about the hardware requirements of training with Deepglint360K(360K ids, 17M images) and WebFace42M(2M ids, 42M images)? Thank you.

eeric commented 3 years ago

GPU: NVIDIA GeForce RTX 2080T

eeric commented 2 years ago

one gpu: https://github.com/fdbtrs/ElasticFace but no partial fc

taosean commented 2 years ago

@eeric I used the script you provided to finetune the model glint360k_cosface_r100_fp16_0.1 downloaded here on glint_asia dataset.

My config is

config = edict()
config.loss = "cosface"
config.network = "r100"
config.resume = True
config.output = "insightface_outputs/glintasia_cosface_r100_fp16_1.0/"
config.embedding_size = 512
config.sample_rate = 1.0
config.fp16 = True
config.momentum = 0.9
config.weight_decay = 5e-4
config.batch_size = 128
# config.lr = 0.1  # batch size is 512
config.lr = 0.01  # batch size is 128

config.dataset = "glintasia"
config.rec = "faces/faces_glintasia/"
config.num_classes = 93979
config.num_image = 2830146
config.warmup_epoch = -1

However, after 1 epoch, the loss starts to diverge and gets bigger and bigger. I did not modify other things.

Have you ever encountered this problem? And do you know why this happen.

Thanks!

eeric commented 2 years ago

there was may error to train_one_gpu.py please refer to https://github.com/eeric/Face_recognition_cnn/issues/4#issuecomment-997318897 or using paddle framework https://github.com/deepinsight/insightface/tree/master/recognition/arcface_paddle

in there, you could use one gpu

taosean commented 2 years ago

@eeric So you mean there maybe errors in the train_one_gpu.py? Do you have any idea where it is?

Thanks!

eeric commented 2 years ago

yes, please use another open progect with one gpu: https://github.com/fdbtrs/ElasticFace or other framework: paddlepaddle https://github.com/deepinsight/insightface/tree/master/recognition/arcface_paddle

yenai3726 commented 2 years ago

@taosean hello !!
Regarding train_one_gpu.py I also have the same problem during training, I try to understand the problem, the problem is in train_one_gpu Line 116 x_grad returns an all zero matrix as it cannot learn I did the following

In partial_fc.py # Line219 and change Line 222 x_grad = total_features.grad

It's working fine now Hope this answer helps you ! : )

eeric / Face_recognition_cnn

Convergence problem training with arc_torch/train_one_gpu.py #4