Closed hhhThomas closed 2 years ago
sorry, using one gpu through model can be trained, verification result was little growth, so advice to be not use.
Hi @eeric ,
Thanks for your reply. Do you know if the problem comes from the training stage or the verification stage?
Besides, I would like to try it on a multi GPU server. Do you have any idea of how many GPUs I need and how large the RAM should be?
Ex. If I want to train a mobileFaceNet with MS1M-RetinaFace dataset(93K ids), at least how much RAM do I need?
Could you please give me some suggestions?
Thank you.
1.no less than 2 gpus, volume 11G every gpu;
Excuse me, one more question. Could I also inquire about the hardware requirements of training with Deepglint360K(360K ids, 17M images) and WebFace42M(2M ids, 42M images)? Thank you.
GPU: NVIDIA GeForce RTX 2080T
one gpu: https://github.com/fdbtrs/ElasticFace but no partial fc
@eeric I used the script you provided to finetune the model glint360k_cosface_r100_fp16_0.1 downloaded here on glint_asia dataset.
My config is
config = edict()
config.loss = "cosface"
config.network = "r100"
config.resume = True
config.output = "insightface_outputs/glintasia_cosface_r100_fp16_1.0/"
config.embedding_size = 512
config.sample_rate = 1.0
config.fp16 = True
config.momentum = 0.9
config.weight_decay = 5e-4
config.batch_size = 128
# config.lr = 0.1 # batch size is 512
config.lr = 0.01 # batch size is 128
config.dataset = "glintasia"
config.rec = "faces/faces_glintasia/"
config.num_classes = 93979
config.num_image = 2830146
config.warmup_epoch = -1
However, after 1 epoch, the loss starts to diverge and gets bigger and bigger. I did not modify other things.
Have you ever encountered this problem? And do you know why this happen.
Thanks!
there was may error to train_one_gpu.py please refer to https://github.com/eeric/Face_recognition_cnn/issues/4#issuecomment-997318897 or using paddle framework https://github.com/deepinsight/insightface/tree/master/recognition/arcface_paddle
in there, you could use one gpu
@eeric So you mean there maybe errors in the train_one_gpu.py? Do you have any idea where it is?
Thanks!
yes, please use another open progect with one gpu: https://github.com/fdbtrs/ElasticFace or other framework: paddlepaddle https://github.com/deepinsight/insightface/tree/master/recognition/arcface_paddle
@taosean hello !!
Regarding train_one_gpu.py I also have the same problem during training, I try to understand the problem, the problem is in train_one_gpu Line 116 x_grad returns an all zero matrix as it cannot learn
I did the following
In partial_fc.py # Line219 and change Line 222 x_grad = total_features.grad
It's working fine now Hope this answer helps you ! : )
I am training a mobilefacenet with CASIA_WebFace dataset downloaded from insightface Dataset Zoo. However, I have trouble getting it converged. The loss stops going down after a few steps. Below are my config settings. I have also tried training with cosface, setting batch size=128, lr=0.001, and s=32 in arcface/cosface loss. Am I missing something? Could you give me some advice or share your settings? I would greatly appreciate it if you could help me out.
config = edict() config.loss = "arcface" config.network = "mbf" config.resume = False config.output = None config.embedding_size = 512 config.sample_rate = 1.0 config.fp16 = True config.momentum = 0.9 config.weight_decay = 2e-4 config.batch_size = 64 config.lr = 0.01 # batch size is 512
config.rec = "./train_tmp/faces_webface_112x112" config.num_classes = 10572 config.num_image = "forget" config.num_epoch = 34 config.warmup_epoch = -1 config.decay_epoch = [8, 12, 15, 18] config.val_targets = ["lfw", "cfp_fp", "agedb_30"]