shreyanshdas00 commented 2 years ago

I tried to replicate the results for ViT Base model trained on WebFace42M but the model does not seem to converge. The loss starts at 53 and stagnates at about 22 after a few epochs of training. I have used the exact same config, with the max. learning rate scaled according to my batch size. I am using 4 GPUs for training and the config variables are as below:

` config.network = "vit_b"

config.embedding_size = 256

Partial FC

config.sample_rate = 1

config.fp16 = True config.batch_size = 1500

For AdamW

config.optimizer = "adamw" config.lr = 0.00025 config.weight_decay = 0.1

config.verbose = 1415 config.dali = False

config.rec = "/media/data/Webface42M_rec" config.num_classes = 2059906 config.num_epoch = 40 config.warmup_epoch = config.num_epoch//10 `

Do you have any insights on why this could be happening?

Any help would be highly appreciated @anxiangsir

jacqueline-weng commented 2 years ago

I also applied vit_t and had loss stuck at 22 for quit a few epochs. I stopped the training and restarted it with SGD to see any change.

anxiangsir commented 2 years ago

Hi, jacqueline-weng , have you tried this？

https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/wf42m_pfc02_40epoch_8gpu_vit_t/training.log

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/configs/wf42m_pfc03_40epoch_8gpu_vit_t.py

shreyanshdas00 commented 2 years ago

@anxiangsir I have tried with the exact same config (apart from batch_size and lr) and it still does not seem to converge.

anxiangsir commented 2 years ago

Hi, shreyanshdas00, we will update a PFC with accumulate-gradient to handle ViT that require very large batch sizes. For example, accumulating gradients for 16 iterations approximates of the batch size of 24K.

anxiangsir commented 2 years ago

Hi, jacqueline-weng , I will train ViT-T tonight with the latest code, to check if it can be reproduced.

Training: 2022-07-04 19:49:59,615-: margin_list              [1.0, 0.0, 0.4]
Training: 2022-07-04 19:49:59,616-: network                  vit_t_dp005_mask0
Training: 2022-07-04 19:49:59,616-: resume                   False
Training: 2022-07-04 19:49:59,616-: save_all_states          False
Training: 2022-07-04 19:49:59,616-: output                   work_dirs/wf42m_pfc03_40epoch_8gpu_vit_t
Training: 2022-07-04 19:49:59,616-: embedding_size           512
Training: 2022-07-04 19:49:59,616-: sample_rate              0.3
Training: 2022-07-04 19:49:59,616-: interclass_filtering_threshold0
Training: 2022-07-04 19:49:59,616-: fp16                     True
Training: 2022-07-04 19:49:59,616-: batch_size               512
Training: 2022-07-04 19:49:59,616-: optimizer                adamw
Training: 2022-07-04 19:49:59,616-: lr                       0.001
Training: 2022-07-04 19:49:59,616-: momentum                 0.9
Training: 2022-07-04 19:49:59,616-: weight_decay             0.1
Training: 2022-07-04 19:49:59,616-: verbose                  2000
Training: 2022-07-04 19:49:59,616-: frequent                 10
Training: 2022-07-04 19:49:59,617-: dali                     True
Training: 2022-07-04 19:49:59,617-: seed                     2048
Training: 2022-07-04 19:49:59,617-: num_workers              2
Training: 2022-07-04 19:49:59,617-: rec                      /train_tmp/WebFace42M
Training: 2022-07-04 19:49:59,617-: num_classes              2059906
Training: 2022-07-04 19:49:59,617-: num_image                42474557
Training: 2022-07-04 19:49:59,617-: num_epoch                40
Training: 2022-07-04 19:49:59,617-: warmup_epoch             4
Training: 2022-07-04 19:49:59,617-: val_targets              []
Training: 2022-07-04 19:49:59,617-: total_batch_size         4096
Training: 2022-07-04 19:49:59,617-: warmup_step              41476
Training: 2022-07-04 19:49:59,617-: total_step               414760
Training: 2022-07-04 19:50:02,872-Reducer buckets have been rebuilt in this iteration.
Training: 2022-07-04 19:50:11,077-Speed 8975.71 samples/sec   Loss 42.8994   LearningRate 0.000000   Epoch: 0   Global Step: 20   Fp16 Grad Scale: 65536   Required: 60 hours
Training: 2022-07-04 19:50:15,617-Speed 9025.80 samples/sec   Loss 42.8919   LearningRate 0.000001   Epoch: 0   Global Step: 30   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:20,164-Speed 9008.30 samples/sec   Loss 42.8912   LearningRate 0.000001   Epoch: 0   Global Step: 40   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:24,703-Speed 9027.74 samples/sec   Loss 42.8725   LearningRate 0.000001   Epoch: 0   Global Step: 50   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:29,265-Speed 8980.16 samples/sec   Loss 42.8666   LearningRate 0.000001   Epoch: 0   Global Step: 60   Fp16 Grad Scale: 65536   Required: 55 hours
Training: 2022-07-04 19:50:33,809-Speed 9018.63 samples/sec   Loss 42.8707   LearningRate 0.000002   Epoch: 0   Global Step: 70   Fp16 Grad Scale: 65536   Required: 55 hours
Training: 2022-07-04 19:50:38,347-Speed 9026.76 samples/sec   Loss 42.8631   LearningRate 0.000002   Epoch: 0   Global Step: 80   Fp16 Grad Scale: 65536   Required: 54 hours
Training: 2022-07-04 19:50:42,909-Speed 8980.33 samples/sec   Loss 42.7995   LearningRate 0.000002   Epoch: 0   Global Step: 90   Fp16 Grad Scale: 65536   Required: 54 hours
Training: 2022-07-04 19:50:47,477-Speed 8969.62 samples/sec   Loss 42.7913   LearningRate 0.000002   Epoch: 0   Global Step: 100   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:50:52,046-Speed 8966.48 samples/sec   Loss 42.7773   LearningRate 0.000003   Epoch: 0   Global Step: 110   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:50:56,611-Speed 8973.38 samples/sec   Loss 42.7429   LearningRate 0.000003   Epoch: 0   Global Step: 120   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:01,163-Speed 9001.72 samples/sec   Loss 42.6983   LearningRate 0.000003   Epoch: 0   Global Step: 130   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:05,709-Speed 9011.11 samples/sec   Loss 42.6910   LearningRate 0.000003   Epoch: 0   Global Step: 140   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:10,250-Speed 9023.66 samples/sec   Loss 42.6167   LearningRate 0.000004   Epoch: 0   Global Step: 150   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:14,790-Speed 9025.02 samples/sec   Loss 42.5821   LearningRate 0.000004   Epoch: 0   Global Step: 160   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:19,324-Speed 9035.34 samples/sec   Loss 42.5213   LearningRate 0.000004   Epoch: 0   Global Step: 170   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:23,878-Speed 8997.40 samples/sec   Loss 42.4732   LearningRate 0.000004   Epoch: 0   Global Step: 180   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:28,417-Speed 9024.36 samples/sec   Loss 42.4071   LearningRate 0.000005   Epoch: 0   Global Step: 190   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:32,965-Speed 9008.65 samples/sec   Loss 42.3038   LearningRate 0.000005   Epoch: 0   Global Step: 200   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:37,510-Speed 9014.58 samples/sec   Loss 42.2272   LearningRate 0.000005   Epoch: 0   Global Step: 210   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:42,061-Speed 9001.34 samples/sec   Loss 42.1281   LearningRate 0.000005   Epoch: 0   Global Step: 220   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:46,620-Speed 8987.33 samples/sec   Loss 42.0046   LearningRate 0.000006   Epoch: 0   Global Step: 230   Fp16 Grad Scale: 262144   Required: 53 hours

This is my server configs: 8 * 32GB V100

anxiangsir commented 2 years ago

Hi jacqueline-weng, this is my result :

Training: 2022-07-05 08:29:37,789-Speed 9036.35 samples/sec   Loss 6.3236   LearningRate 0.000717   Epoch: 9   Global Step: 98720   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:42,322-Speed 9038.57 samples/sec   Loss 6.2150   LearningRate 0.000717   Epoch: 9   Global Step: 98730   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:46,851-Speed 9045.42 samples/sec   Loss 6.2928   LearningRate 0.000717   Epoch: 9   Global Step: 98740   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:51,379-Speed 9049.36 samples/sec   Loss 6.3178   LearningRate 0.000717   Epoch: 9   Global Step: 98750   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:55,904-Speed 9054.48 samples/sec   Loss 6.2231   LearningRate 0.000717   Epoch: 9   Global Step: 98760   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:00,433-Speed 9047.04 samples/sec   Loss 6.2004   LearningRate 0.000717   Epoch: 9   Global Step: 98770   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:04,969-Speed 9031.39 samples/sec   Loss 6.2435   LearningRate 0.000717   Epoch: 9   Global Step: 98780   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:30:09,478-Speed 9087.44 samples/sec   Loss 6.2260   LearningRate 0.000716   Epoch: 9   Global Step: 98790   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:14,009-Speed 9041.70 samples/sec   Loss 6.2667   LearningRate 0.000716   Epoch: 9   Global Step: 98800   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:18,544-Speed 9032.99 samples/sec   Loss 6.2365   LearningRate 0.000716   Epoch: 9   Global Step: 98810   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:23,069-Speed 9054.91 samples/sec   Loss 6.1696   LearningRate 0.000716   Epoch: 9   Global Step: 98820   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:27,599-Speed 9044.99 samples/sec   Loss 6.2435   LearningRate 0.000716   Epoch: 9   Global Step: 98830   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:32,143-Speed 9016.28 samples/sec   Loss 6.3440   LearningRate 0.000716   Epoch: 9   Global Step: 98840   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:36,677-Speed 9035.36 samples/sec   Loss 6.2971   LearningRate 0.000716   Epoch: 9   Global Step: 98850   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:41,205-Speed 9050.01 samples/sec   Loss 6.2351   LearningRate 0.000716   Epoch: 9   Global Step: 98860   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:45,744-Speed 9025.46 samples/sec   Loss 6.2534   LearningRate 0.000716   Epoch: 9   Global Step: 98870   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:50,278-Speed 9035.56 samples/sec   Loss 6.2644   LearningRate 0.000716   Epoch: 9   Global Step: 98880   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:54,831-Speed 8999.76 samples/sec   Loss 6.2196   LearningRate 0.000716   Epoch: 9   Global Step: 98890   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:30:59,455-Speed 8858.93 samples/sec   Loss 6.2450   LearningRate 0.000716   Epoch: 9   Global Step: 98900   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:03,987-Speed 9040.14 samples/sec   Loss 6.2852   LearningRate 0.000716   Epoch: 9   Global Step: 98910   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:08,519-Speed 9040.10 samples/sec   Loss 6.2288   LearningRate 0.000716   Epoch: 9   Global Step: 98920   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:13,056-Speed 9031.09 samples/sec   Loss 6.3165   LearningRate 0.000716   Epoch: 9   Global Step: 98930   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:17,586-Speed 9044.34 samples/sec   Loss 6.2852   LearningRate 0.000716   Epoch: 9   Global Step: 98940   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:22,101-Speed 9073.97 samples/sec   Loss 6.2712   LearningRate 0.000716   Epoch: 9   Global Step: 98950   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:26,625-Speed 9057.15 samples/sec   Loss 6.2816   LearningRate 0.000716   Epoch: 9   Global Step: 98960   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:31,154-Speed 9046.29 samples/sec   Loss 6.2800   LearningRate 0.000716   Epoch: 9   Global Step: 98970   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:35,684-Speed 9043.57 samples/sec   Loss 6.2846   LearningRate 0.000716   Epoch: 9   Global Step: 98980   Fp16 Grad Scale: 32768   Required: 40 hours

loss

anxiangsir commented 2 years ago

@anxiangsir I have tried with the exact same config (apart from batch_size and lr) and it still does not seem to converge.

Hi shreyanshdas00, I think you should keep your learning rate at 0.001 and increase your batch size，when optimizer is adamw.

jacqueline-weng commented 2 years ago

Hi, jacqueline-weng , have you tried this？

https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/wf42m_pfc02_40epoch_8gpu_vit_t/training.log

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/configs/wf42m_pfc03_40epoch_8gpu_vit_t.py

Thank you for replying me and showing me the vit_t training result. Things were a bit more complicated in my case.

I failed to run the newest version training code for VIT. The machine would reboot itself before finishing loading the whole dataset (webface42m). I guess it would be some reason of the environment configurations but I did not have many chances to figure it out. Therefore, I directly transferred and embedded your VIT backbone code and partial FC code for AdamW.
I have a machine of 8 T4 gpus. I could only set the batchsize to 128 while decreasing the learning rate. I'm not sure for AdamW whether the starting learning rate should be inverse-proportional to batchsize as well. Is it possible to replicate the result while using a smaller batchsize?

Any advise for my case is really appreciated.

shreyanshdas00 commented 2 years ago

Hi @anxiangsir , with our batch size (1500x4) using AdamW with a LR of 0.001 results in NaNs in the loss. We are using 4 80GB A100s and it is not possible to increase the batch size further. How can we train ViT architectures with a lower batch size than yours?

abdikaiym01 commented 2 years ago

Hi @anxiangsir I noticed when learning on your framework ( recognition/arcface_torch ), the loss jumps after each epoch, what do you think it could be from? Maybe because DistributedSampler ( utils.utils_distributed_sampler ) is not working correctly?

shreyanshdas00 commented 2 years ago

Hi @anxiangsir I observed that the ViT architectures converge when we use a CosFace margin (of 0.4), but do not converge (stagnate at a loss of about 20) when ArcFace (with a margin of 0.5) is used. Do you have any insights on why this would happen?

anxiangsir commented 2 years ago

Hi, jacqueline-weng ，shreyanshdas00

you can try this:

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/README.md#3-run-vit-b-on-a-machine-with-24k-batchsize

We using gradient accumulation to insrease batchsize. Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially. Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

You can increase config.gradient_acc to increase total batchsize:
total_batchsize = config.batch_size * config.gradient_acc * worldsize

This is my result:

training loss:

training log: https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/pfc03_wf42m_vit_b_8gpu/training.log

IJBC result:

+---------------+-------+-------+--------+-------+-------+-------+
|    Methods    | 1e-06 | 1e-05 | 0.0001 | 0.001 |  0.01 |  0.1  |
+---------------+-------+-------+--------+-------+-------+-------+
| IJBC.npy-IJBC | 92.96 | 97.05 | 97.91  | 98.47 | 98.86 | 99.34 |
+---------------+-------+-------+--------+-------+-------+-------+

shreyanshdas00 commented 2 years ago

Thanks @anxiangsir, I will try this out. Can you also share any insights you have on why ViT architectures do not converge when ArcFace loss is used? They seem to work fine when CosFace margin is used (as you have done in your experiments) but stagnate around a loss of 20 when ArcFace is used even though the two loss functions are quite similar, why could this be happening?

xsacha commented 2 years ago

Despite loss not decreasing, does accuracy remain competitive on validation sets? I don't use this repo but I have a similar issue when I use a parallel model where loss stays around 20 yet the model is correctly converging and has > 98% IJBC

jacqueline-weng commented 2 years ago

Thanks @anxiangsir, I'm trying with accumulative gradient. An error occurs at the first backward saying some parameters in the module are marked ready to reduce twice. After changing the module distribution code, it works. backbone = torch.nn.parallel.DistributedDataParallel( module=backbone, broadcast_buffers=False, device_ids=[args.local_rank],find_unused_parameters=False) (The original 'find_unused_parameters' is set to True.)

jacqueline-weng commented 2 years ago

Thanks @anxiangsir, I will try this out. Can you also share any insights you have on why ViT architectures do not converge when ArcFace loss is used? They seem to work fine when CosFace margin is used (as you have done in your experiments) but stagnate around a loss of 20 when ArcFace is used even though the two loss functions are quite similar, why could this be happening?

From my experience, ArcFace is often more difficult to converge than CosFace no matter what the backbone is. ArcFace starts with a high loss (probably depends on margin setting) and remain higher during the training process. By nature, angular margins make logits more sensitive and precipitous than additive margins, thus harder to converge. My understanding is the margin hyper-parameter should be set properly and maybe dynamic during training. Using different losses during different training stages maybe better choice than sticking to one.

Any ideas are welcomed.

NNDam commented 2 years ago

@jacqueline-weng how about CurricularFace or AdaFace ?

mlourencoeb commented 1 year ago

Hi everyone,

@anxiangsir did you try using the accumulated gradient with a larger batch size for vit_b? Can you obtain similar results to the one here?

If not, what is your gut feeling about it?

Best, /M

WisleyWang commented 1 year ago

Hi everyone. we also have same problem,bu we have solved it。Hope that can helps you

If use Arcface+Aadam ,firstly your learning rate must be 1e-3 and below , and secondly your margin must not be too large.

The result of my test is as follows: we have observed that the author uses CosFace and margin=0.4 , but when Arcface and AdamW are used together, the initial learning is very diffcult ,which will lead to the value Nan and not converge. Whe we reduce the margin of ArcFace,it will converge normally.

CosFace seems to converge more easily, while AdamW seems to calculate a larger and mosensitive gradient than SGD. If you use AdamW ,maybe you need to adjust your margin and lr.

Jar7 commented 1 year ago

Hi jacqueline-weng, this is my result :

Training: 2022-07-05 08:29:37,789-Speed 9036.35 samples/sec   Loss 6.3236   LearningRate 0.000717   Epoch: 9   Global Step: 98720   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:42,322-Speed 9038.57 samples/sec   Loss 6.2150   LearningRate 0.000717   Epoch: 9   Global Step: 98730   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:46,851-Speed 9045.42 samples/sec   Loss 6.2928   LearningRate 0.000717   Epoch: 9   Global Step: 98740   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:51,379-Speed 9049.36 samples/sec   Loss 6.3178   LearningRate 0.000717   Epoch: 9   Global Step: 98750   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:29:55,904-Speed 9054.48 samples/sec   Loss 6.2231   LearningRate 0.000717   Epoch: 9   Global Step: 98760   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:00,433-Speed 9047.04 samples/sec   Loss 6.2004   LearningRate 0.000717   Epoch: 9   Global Step: 98770   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:04,969-Speed 9031.39 samples/sec   Loss 6.2435   LearningRate 0.000717   Epoch: 9   Global Step: 98780   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:30:09,478-Speed 9087.44 samples/sec   Loss 6.2260   LearningRate 0.000716   Epoch: 9   Global Step: 98790   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:14,009-Speed 9041.70 samples/sec   Loss 6.2667   LearningRate 0.000716   Epoch: 9   Global Step: 98800   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:18,544-Speed 9032.99 samples/sec   Loss 6.2365   LearningRate 0.000716   Epoch: 9   Global Step: 98810   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:23,069-Speed 9054.91 samples/sec   Loss 6.1696   LearningRate 0.000716   Epoch: 9   Global Step: 98820   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:27,599-Speed 9044.99 samples/sec   Loss 6.2435   LearningRate 0.000716   Epoch: 9   Global Step: 98830   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:32,143-Speed 9016.28 samples/sec   Loss 6.3440   LearningRate 0.000716   Epoch: 9   Global Step: 98840   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:36,677-Speed 9035.36 samples/sec   Loss 6.2971   LearningRate 0.000716   Epoch: 9   Global Step: 98850   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:41,205-Speed 9050.01 samples/sec   Loss 6.2351   LearningRate 0.000716   Epoch: 9   Global Step: 98860   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:45,744-Speed 9025.46 samples/sec   Loss 6.2534   LearningRate 0.000716   Epoch: 9   Global Step: 98870   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:50,278-Speed 9035.56 samples/sec   Loss 6.2644   LearningRate 0.000716   Epoch: 9   Global Step: 98880   Fp16 Grad Scale: 32768   Required: 41 hours
Training: 2022-07-05 08:30:54,831-Speed 8999.76 samples/sec   Loss 6.2196   LearningRate 0.000716   Epoch: 9   Global Step: 98890   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:30:59,455-Speed 8858.93 samples/sec   Loss 6.2450   LearningRate 0.000716   Epoch: 9   Global Step: 98900   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:03,987-Speed 9040.14 samples/sec   Loss 6.2852   LearningRate 0.000716   Epoch: 9   Global Step: 98910   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:08,519-Speed 9040.10 samples/sec   Loss 6.2288   LearningRate 0.000716   Epoch: 9   Global Step: 98920   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:13,056-Speed 9031.09 samples/sec   Loss 6.3165   LearningRate 0.000716   Epoch: 9   Global Step: 98930   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:17,586-Speed 9044.34 samples/sec   Loss 6.2852   LearningRate 0.000716   Epoch: 9   Global Step: 98940   Fp16 Grad Scale: 65536   Required: 41 hours
Training: 2022-07-05 08:31:22,101-Speed 9073.97 samples/sec   Loss 6.2712   LearningRate 0.000716   Epoch: 9   Global Step: 98950   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:26,625-Speed 9057.15 samples/sec   Loss 6.2816   LearningRate 0.000716   Epoch: 9   Global Step: 98960   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:31,154-Speed 9046.29 samples/sec   Loss 6.2800   LearningRate 0.000716   Epoch: 9   Global Step: 98970   Fp16 Grad Scale: 32768   Required: 40 hours
Training: 2022-07-05 08:31:35,684-Speed 9043.57 samples/sec   Loss 6.2846   LearningRate 0.000716   Epoch: 9   Global Step: 98980   Fp16 Grad Scale: 32768   Required: 40 hours

loss

Hi @anxiangsir I follow the configuration as yours(from this issue and https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/wf42m_pfc02_40epoch_8gpu_vit_t/training.log) , and train on a1004gpu & 30908gpu, total batchsize is always 2048. However, the loss does not drop as your result.

NCCL version 2.10.3+cuda11.1
Training: 2023-05-25 16:21:01,319-: margin_list              [1.0, 0.0, 0.4]
Training: 2023-05-25 16:21:01,326-: network                  vit_t_dp005_mask0
Training: 2023-05-25 16:21:01,326-: resume                   False
Training: 2023-05-25 16:21:01,326-: resume_checkpoint        None
Training: 2023-05-25 16:21:01,326-: load_pretrained          None
Training: 2023-05-25 16:21:01,326-: save_all_states          True
Training: 2023-05-25 16:21:01,326-: output                   /model_dir
Training: 2023-05-25 16:21:01,326-: embedding_size           512
Training: 2023-05-25 16:21:01,326-: sample_rate              0.2
Training: 2023-05-25 16:21:01,326-: interclass_filtering_threshold0
Training: 2023-05-25 16:21:01,326-: fp16                     True
Training: 2023-05-25 16:21:01,326-: batch_size               512
Training: 2023-05-25 16:21:01,326-: optimizer                adamw
Training: 2023-05-25 16:21:01,326-: lr                       0.001
Training: 2023-05-25 16:21:01,326-: momentum                 0.9
Training: 2023-05-25 16:21:01,326-: weight_decay             0.1
Training: 2023-05-25 16:21:01,326-: verbose                  2000
Training: 2023-05-25 16:21:01,326-: frequent                 10
Training: 2023-05-25 16:21:01,326-: dali                     True
Training: 2023-05-25 16:21:01,326-: dali_aug                 False
Training: 2023-05-25 16:21:01,326-: gradient_acc             1
Training: 2023-05-25 16:21:01,326-: seed                     2048
Training: 2023-05-25 16:21:01,326-: num_workers              2
Training: 2023-05-25 16:21:01,326-: wandb_key                XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Training: 2023-05-25 16:21:01,326-: suffix_run_name          None
Training: 2023-05-25 16:21:01,327-: using_wandb              False
Training: 2023-05-25 16:21:01,327-: wandb_entity             entity
Training: 2023-05-25 16:21:01,327-: wandb_project            project
Training: 2023-05-25 16:21:01,327-: wandb_log_all            True
Training: 2023-05-25 16:21:01,327-: save_artifacts           False
Training: 2023-05-25 16:21:01,327-: wandb_resume             False
Training: 2023-05-25 16:21:01,327-: rec                      ./wf42m_mx
Training: 2023-05-25 16:21:01,327-: num_classes              2059906
Training: 2023-05-25 16:21:01,327-: num_image                42474557
Training: 2023-05-25 16:21:01,327-: num_epoch                40
Training: 2023-05-25 16:21:01,327-: warmup_epoch             4
Training: 2023-05-25 16:21:01,327-: val_targets              []
Training: 2023-05-25 16:21:01,327-: total_batch_size         2048
Training: 2023-05-25 16:21:01,327-: warmup_step              82956
Training: 2023-05-25 16:21:01,327-: total_step               829560
Training: 2023-05-25 16:21:02,386-Reducer buckets have been rebuilt in this iteration.
Training: 2023-05-25 16:21:08,387-Speed 6156.32 samples/sec   Loss 42.4324   LearningRate 0.000000   Epoch: 0   Global Step: 20   Fp16 Grad Scale: 65536   Required: 77 hours
Training: 2023-05-25 16:21:11,717-Speed 6154.12 samples/sec   Loss 42.4117   LearningRate 0.000000   Epoch: 0   Global Step: 30   Fp16 Grad Scale: 65536   Required: 74 hours
Training: 2023-05-25 16:21:15,039-Speed 6165.80 samples/sec   Loss 42.4171   LearningRate 0.000000   Epoch: 0   Global Step: 40   Fp16 Grad Scale: 65536   Required: 73 hours
Training: 2023-05-25 16:21:18,361-Speed 6167.04 samples/sec   Loss 42.4250   LearningRate 0.000001   Epoch: 0   Global Step: 50   Fp16 Grad Scale: 65536   Required: 77 hours
Training: 2023-05-25 16:21:21,686-Speed 6161.36 samples/sec   Loss 42.4015   LearningRate 0.000001   Epoch: 0   Global Step: 60   Fp16 Grad Scale: 65536   Required: 76 hours

Epoch 9 ( loss is about 6 in your traininglog)

Training: 2023-05-26 09:40:40,291-Speed 6146.81 samples/sec   Loss 10.5731   LearningRate 0.000861   Epoch: 8   Global Step: 186570   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:40:43,623-Speed 6148.40 samples/sec   Loss 10.3575   LearningRate 0.000861   Epoch: 8   Global Step: 186580   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:40:46,953-Speed 6150.55 samples/sec   Loss 10.4854   LearningRate 0.000861   Epoch: 8   Global Step: 186590   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:40:50,284-Speed 6148.70 samples/sec   Loss 10.4781   LearningRate 0.000861   Epoch: 8   Global Step: 186600   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:40:53,615-Speed 6149.34 samples/sec   Loss 10.5193   LearningRate 0.000861   Epoch: 8   Global Step: 186610   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:40:56,947-Speed 6147.19 samples/sec   Loss 10.4351   LearningRate 0.000861   Epoch: 8   Global Step: 186620   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:00,282-Speed 6142.53 samples/sec   Loss 10.5754   LearningRate 0.000861   Epoch: 8   Global Step: 186630   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:03,612-Speed 6151.46 samples/sec   Loss 10.4418   LearningRate 0.000861   Epoch: 8   Global Step: 186640   Fp16 Grad Scale: 65536   Required: 60 hours
Training: 2023-05-26 09:41:06,945-Speed 6144.56 samples/sec   Loss 10.4398   LearningRate 0.000861   Epoch: 8   Global Step: 186650   Fp16 Grad Scale: 65536   Required: 60 hours
Training: 2023-05-26 09:41:10,778-Speed 5344.15 samples/sec   Loss 10.4727   LearningRate 0.000861   Epoch: 9   Global Step: 186660   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:14,113-Speed 6142.45 samples/sec   Loss 10.3351   LearningRate 0.000861   Epoch: 9   Global Step: 186670   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:17,444-Speed 6149.21 samples/sec   Loss 10.5266   LearningRate 0.000861   Epoch: 9   Global Step: 186680   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:20,776-Speed 6146.86 samples/sec   Loss 10.4695   LearningRate 0.000861   Epoch: 9   Global Step: 186690   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:24,107-Speed 6149.43 samples/sec   Loss 10.4526   LearningRate 0.000861   Epoch: 9   Global Step: 186700   Fp16 Grad Scale: 32768   Required: 60 hours
Training: 2023-05-26 09:41:27,440-Speed 6146.48 samples/sec   Loss 10.4886   LearningRate 0.000861   Epoch: 9   Global Step: 186710   Fp16 Grad Scale: 32768   Required: 60 hours

Final 40th epoch

Training: 2023-05-28 21:18:16,789-Speed 6147.49 samples/sec   Loss 3.3835   LearningRate 0.000000   Epoch: 39   Global Step: 829460   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:20,121-Speed 6146.14 samples/sec   Loss 3.4250   LearningRate 0.000000   Epoch: 39   Global Step: 829470   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:23,456-Speed 6142.79 samples/sec   Loss 3.4518   LearningRate 0.000000   Epoch: 39   Global Step: 829480   Fp16 Grad Scale: 16384   Required: 0 hours
Training: 2023-05-28 21:18:26,764-Speed 6192.20 samples/sec   Loss 3.4075   LearningRate 0.000000   Epoch: 39   Global Step: 829490   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:30,096-Speed 6146.77 samples/sec   Loss 3.3720   LearningRate 0.000000   Epoch: 39   Global Step: 829500   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:33,426-Speed 6151.32 samples/sec   Loss 3.4363   LearningRate 0.000000   Epoch: 39   Global Step: 829510   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:36,756-Speed 6151.00 samples/sec   Loss 3.4455   LearningRate 0.000000   Epoch: 39   Global Step: 829520   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:40,087-Speed 6148.99 samples/sec   Loss 3.3558   LearningRate 0.000000   Epoch: 39   Global Step: 829530   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:43,420-Speed 6145.44 samples/sec   Loss 3.4064   LearningRate 0.000000   Epoch: 39   Global Step: 829540   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:46,750-Speed 6152.61 samples/sec   Loss 3.4125   LearningRate 0.000000   Epoch: 39   Global Step: 829550   Fp16 Grad Scale: 8192   Required: 0 hours
Training: 2023-05-28 21:18:50,078-Speed 6154.23 samples/sec   Loss 3.3826   LearningRate 0.000000   Epoch: 39   Global Step: 829560   Fp16 Grad Scale: 8192   Required: -0 hours
Training: 2023-05-28 21:18:53,411-Speed 6145.60 samples/sec   Loss 3.3920   LearningRate 0.000000   Epoch: 39   Global Step: 829570   Fp16 Grad Scale: 8192   Required: -0 hours
Training: 2023-05-28 21:18:56,740-Speed 6152.61 samples/sec   Loss 3.4548   LearningRate 0.000000   Epoch: 39   Global Step: 829580   Fp16 Grad Scale: 8192   Required: -0 hours

In your training log, the final loss is about 2.5. Is there any trick I missed? Did you use dali_aug?

FengMu1995 commented 3 months ago

Hi, jacqueline-weng ，shreyanshdas00

you can try this:

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/README.md#3-run-vit-b-on-a-machine-with-24k-batchsize

We using gradient accumulation to insrease batchsize. Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially. Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

You can increase config.gradient_acc to increase total batchsize: total_batchsize = config.batch_size * config.gradient_acc * worldsize

This is my result:

training loss:

training log: https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/pfc03_wf42m_vit_b_8gpu/training.log

IJBC result:
+---------------+-------+-------+--------+-------+-------+-------+
|    Methods    | 1e-06 | 1e-05 | 0.0001 | 0.001 |  0.01 |  0.1  |
+---------------+-------+-------+--------+-------+-------+-------+
| IJBC.npy-IJBC | 92.96 | 97.05 | 97.91  | 98.47 | 98.86 | 99.34 |
+---------------+-------+-------+--------+-------+-------+-------+

I wonder if we need to average the accumulative gradients, like this add " loss = loss / cfg.gradient_acc " to below

''' loss: torch.Tensor = module_partial_fc(local_embeddings, local_labels) if cfg.fp16: amp.scale(loss).backward() if global_step % cfg.gradientacc == 0: amp.unscale(opt) torch.nn.utils.clip_gradnorm(backbone.parameters(), 5) amp.step(opt) amp.update() opt.zero_grad() '''

deepinsight / insightface

Unable to Replicate ViT+WebFace42M model results #2043

Partial FC

For AdamW