Open shreyanshdas00 opened 2 years ago
I also applied vit_t and had loss stuck at 22 for quit a few epochs. I stopped the training and restarted it with SGD to see any change.
@anxiangsir I have tried with the exact same config (apart from batch_size and lr) and it still does not seem to converge.
Hi, shreyanshdas00, we will update a PFC with accumulate-gradient
to handle ViT that require very large batch sizes. For example, accumulating gradients for 16 iterations approximates of the batch size of 24K.
Hi, jacqueline-weng , I will train ViT-T tonight with the latest code, to check if it can be reproduced.
Training: 2022-07-04 19:49:59,615-: margin_list [1.0, 0.0, 0.4]
Training: 2022-07-04 19:49:59,616-: network vit_t_dp005_mask0
Training: 2022-07-04 19:49:59,616-: resume False
Training: 2022-07-04 19:49:59,616-: save_all_states False
Training: 2022-07-04 19:49:59,616-: output work_dirs/wf42m_pfc03_40epoch_8gpu_vit_t
Training: 2022-07-04 19:49:59,616-: embedding_size 512
Training: 2022-07-04 19:49:59,616-: sample_rate 0.3
Training: 2022-07-04 19:49:59,616-: interclass_filtering_threshold0
Training: 2022-07-04 19:49:59,616-: fp16 True
Training: 2022-07-04 19:49:59,616-: batch_size 512
Training: 2022-07-04 19:49:59,616-: optimizer adamw
Training: 2022-07-04 19:49:59,616-: lr 0.001
Training: 2022-07-04 19:49:59,616-: momentum 0.9
Training: 2022-07-04 19:49:59,616-: weight_decay 0.1
Training: 2022-07-04 19:49:59,616-: verbose 2000
Training: 2022-07-04 19:49:59,616-: frequent 10
Training: 2022-07-04 19:49:59,617-: dali True
Training: 2022-07-04 19:49:59,617-: seed 2048
Training: 2022-07-04 19:49:59,617-: num_workers 2
Training: 2022-07-04 19:49:59,617-: rec /train_tmp/WebFace42M
Training: 2022-07-04 19:49:59,617-: num_classes 2059906
Training: 2022-07-04 19:49:59,617-: num_image 42474557
Training: 2022-07-04 19:49:59,617-: num_epoch 40
Training: 2022-07-04 19:49:59,617-: warmup_epoch 4
Training: 2022-07-04 19:49:59,617-: val_targets []
Training: 2022-07-04 19:49:59,617-: total_batch_size 4096
Training: 2022-07-04 19:49:59,617-: warmup_step 41476
Training: 2022-07-04 19:49:59,617-: total_step 414760
Training: 2022-07-04 19:50:02,872-Reducer buckets have been rebuilt in this iteration.
Training: 2022-07-04 19:50:11,077-Speed 8975.71 samples/sec Loss 42.8994 LearningRate 0.000000 Epoch: 0 Global Step: 20 Fp16 Grad Scale: 65536 Required: 60 hours
Training: 2022-07-04 19:50:15,617-Speed 9025.80 samples/sec Loss 42.8919 LearningRate 0.000001 Epoch: 0 Global Step: 30 Fp16 Grad Scale: 65536 Required: 56 hours
Training: 2022-07-04 19:50:20,164-Speed 9008.30 samples/sec Loss 42.8912 LearningRate 0.000001 Epoch: 0 Global Step: 40 Fp16 Grad Scale: 65536 Required: 56 hours
Training: 2022-07-04 19:50:24,703-Speed 9027.74 samples/sec Loss 42.8725 LearningRate 0.000001 Epoch: 0 Global Step: 50 Fp16 Grad Scale: 65536 Required: 56 hours
Training: 2022-07-04 19:50:29,265-Speed 8980.16 samples/sec Loss 42.8666 LearningRate 0.000001 Epoch: 0 Global Step: 60 Fp16 Grad Scale: 65536 Required: 55 hours
Training: 2022-07-04 19:50:33,809-Speed 9018.63 samples/sec Loss 42.8707 LearningRate 0.000002 Epoch: 0 Global Step: 70 Fp16 Grad Scale: 65536 Required: 55 hours
Training: 2022-07-04 19:50:38,347-Speed 9026.76 samples/sec Loss 42.8631 LearningRate 0.000002 Epoch: 0 Global Step: 80 Fp16 Grad Scale: 65536 Required: 54 hours
Training: 2022-07-04 19:50:42,909-Speed 8980.33 samples/sec Loss 42.7995 LearningRate 0.000002 Epoch: 0 Global Step: 90 Fp16 Grad Scale: 65536 Required: 54 hours
Training: 2022-07-04 19:50:47,477-Speed 8969.62 samples/sec Loss 42.7913 LearningRate 0.000002 Epoch: 0 Global Step: 100 Fp16 Grad Scale: 131072 Required: 54 hours
Training: 2022-07-04 19:50:52,046-Speed 8966.48 samples/sec Loss 42.7773 LearningRate 0.000003 Epoch: 0 Global Step: 110 Fp16 Grad Scale: 131072 Required: 54 hours
Training: 2022-07-04 19:50:56,611-Speed 8973.38 samples/sec Loss 42.7429 LearningRate 0.000003 Epoch: 0 Global Step: 120 Fp16 Grad Scale: 131072 Required: 53 hours
Training: 2022-07-04 19:51:01,163-Speed 9001.72 samples/sec Loss 42.6983 LearningRate 0.000003 Epoch: 0 Global Step: 130 Fp16 Grad Scale: 131072 Required: 54 hours
Training: 2022-07-04 19:51:05,709-Speed 9011.11 samples/sec Loss 42.6910 LearningRate 0.000003 Epoch: 0 Global Step: 140 Fp16 Grad Scale: 131072 Required: 54 hours
Training: 2022-07-04 19:51:10,250-Speed 9023.66 samples/sec Loss 42.6167 LearningRate 0.000004 Epoch: 0 Global Step: 150 Fp16 Grad Scale: 131072 Required: 53 hours
Training: 2022-07-04 19:51:14,790-Speed 9025.02 samples/sec Loss 42.5821 LearningRate 0.000004 Epoch: 0 Global Step: 160 Fp16 Grad Scale: 131072 Required: 54 hours
Training: 2022-07-04 19:51:19,324-Speed 9035.34 samples/sec Loss 42.5213 LearningRate 0.000004 Epoch: 0 Global Step: 170 Fp16 Grad Scale: 131072 Required: 53 hours
Training: 2022-07-04 19:51:23,878-Speed 8997.40 samples/sec Loss 42.4732 LearningRate 0.000004 Epoch: 0 Global Step: 180 Fp16 Grad Scale: 131072 Required: 53 hours
Training: 2022-07-04 19:51:28,417-Speed 9024.36 samples/sec Loss 42.4071 LearningRate 0.000005 Epoch: 0 Global Step: 190 Fp16 Grad Scale: 131072 Required: 53 hours
Training: 2022-07-04 19:51:32,965-Speed 9008.65 samples/sec Loss 42.3038 LearningRate 0.000005 Epoch: 0 Global Step: 200 Fp16 Grad Scale: 262144 Required: 53 hours
Training: 2022-07-04 19:51:37,510-Speed 9014.58 samples/sec Loss 42.2272 LearningRate 0.000005 Epoch: 0 Global Step: 210 Fp16 Grad Scale: 262144 Required: 53 hours
Training: 2022-07-04 19:51:42,061-Speed 9001.34 samples/sec Loss 42.1281 LearningRate 0.000005 Epoch: 0 Global Step: 220 Fp16 Grad Scale: 262144 Required: 53 hours
Training: 2022-07-04 19:51:46,620-Speed 8987.33 samples/sec Loss 42.0046 LearningRate 0.000006 Epoch: 0 Global Step: 230 Fp16 Grad Scale: 262144 Required: 53 hours
This is my server configs: 8 * 32GB V100
Hi jacqueline-weng, this is my result :
Training: 2022-07-05 08:29:37,789-Speed 9036.35 samples/sec Loss 6.3236 LearningRate 0.000717 Epoch: 9 Global Step: 98720 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:29:42,322-Speed 9038.57 samples/sec Loss 6.2150 LearningRate 0.000717 Epoch: 9 Global Step: 98730 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:29:46,851-Speed 9045.42 samples/sec Loss 6.2928 LearningRate 0.000717 Epoch: 9 Global Step: 98740 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:29:51,379-Speed 9049.36 samples/sec Loss 6.3178 LearningRate 0.000717 Epoch: 9 Global Step: 98750 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:29:55,904-Speed 9054.48 samples/sec Loss 6.2231 LearningRate 0.000717 Epoch: 9 Global Step: 98760 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:00,433-Speed 9047.04 samples/sec Loss 6.2004 LearningRate 0.000717 Epoch: 9 Global Step: 98770 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:04,969-Speed 9031.39 samples/sec Loss 6.2435 LearningRate 0.000717 Epoch: 9 Global Step: 98780 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:30:09,478-Speed 9087.44 samples/sec Loss 6.2260 LearningRate 0.000716 Epoch: 9 Global Step: 98790 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:14,009-Speed 9041.70 samples/sec Loss 6.2667 LearningRate 0.000716 Epoch: 9 Global Step: 98800 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:18,544-Speed 9032.99 samples/sec Loss 6.2365 LearningRate 0.000716 Epoch: 9 Global Step: 98810 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:23,069-Speed 9054.91 samples/sec Loss 6.1696 LearningRate 0.000716 Epoch: 9 Global Step: 98820 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:27,599-Speed 9044.99 samples/sec Loss 6.2435 LearningRate 0.000716 Epoch: 9 Global Step: 98830 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:32,143-Speed 9016.28 samples/sec Loss 6.3440 LearningRate 0.000716 Epoch: 9 Global Step: 98840 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:36,677-Speed 9035.36 samples/sec Loss 6.2971 LearningRate 0.000716 Epoch: 9 Global Step: 98850 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:41,205-Speed 9050.01 samples/sec Loss 6.2351 LearningRate 0.000716 Epoch: 9 Global Step: 98860 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:45,744-Speed 9025.46 samples/sec Loss 6.2534 LearningRate 0.000716 Epoch: 9 Global Step: 98870 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:50,278-Speed 9035.56 samples/sec Loss 6.2644 LearningRate 0.000716 Epoch: 9 Global Step: 98880 Fp16 Grad Scale: 32768 Required: 41 hours
Training: 2022-07-05 08:30:54,831-Speed 8999.76 samples/sec Loss 6.2196 LearningRate 0.000716 Epoch: 9 Global Step: 98890 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:30:59,455-Speed 8858.93 samples/sec Loss 6.2450 LearningRate 0.000716 Epoch: 9 Global Step: 98900 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:31:03,987-Speed 9040.14 samples/sec Loss 6.2852 LearningRate 0.000716 Epoch: 9 Global Step: 98910 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:31:08,519-Speed 9040.10 samples/sec Loss 6.2288 LearningRate 0.000716 Epoch: 9 Global Step: 98920 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:31:13,056-Speed 9031.09 samples/sec Loss 6.3165 LearningRate 0.000716 Epoch: 9 Global Step: 98930 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:31:17,586-Speed 9044.34 samples/sec Loss 6.2852 LearningRate 0.000716 Epoch: 9 Global Step: 98940 Fp16 Grad Scale: 65536 Required: 41 hours
Training: 2022-07-05 08:31:22,101-Speed 9073.97 samples/sec Loss 6.2712 LearningRate 0.000716 Epoch: 9 Global Step: 98950 Fp16 Grad Scale: 32768 Required: 40 hours
Training: 2022-07-05 08:31:26,625-Speed 9057.15 samples/sec Loss 6.2816 LearningRate 0.000716 Epoch: 9 Global Step: 98960 Fp16 Grad Scale: 32768 Required: 40 hours
Training: 2022-07-05 08:31:31,154-Speed 9046.29 samples/sec Loss 6.2800 LearningRate 0.000716 Epoch: 9 Global Step: 98970 Fp16 Grad Scale: 32768 Required: 40 hours
Training: 2022-07-05 08:31:35,684-Speed 9043.57 samples/sec Loss 6.2846 LearningRate 0.000716 Epoch: 9 Global Step: 98980 Fp16 Grad Scale: 32768 Required: 40 hours
@anxiangsir I have tried with the exact same config (apart from batch_size and lr) and it still does not seem to converge.
Hi shreyanshdas00, I think you should keep your learning rate at 0.001 and increase your batch size,when optimizer is adamw.
Hi, jacqueline-weng , have you tried this?
Thank you for replying me and showing me the vit_t training result. Things were a bit more complicated in my case.
I failed to run the newest version training code for VIT. The machine would reboot itself before finishing loading the whole dataset (webface42m). I guess it would be some reason of the environment configurations but I did not have many chances to figure it out. Therefore, I directly transferred and embedded your VIT backbone code and partial FC code for AdamW.
I have a machine of 8 T4 gpus. I could only set the batchsize to 128 while decreasing the learning rate. I'm not sure for AdamW whether the starting learning rate should be inverse-proportional to batchsize as well. Is it possible to replicate the result while using a smaller batchsize?
Any advise for my case is really appreciated.
Hi @anxiangsir , with our batch size (1500x4) using AdamW with a LR of 0.001 results in NaNs in the loss. We are using 4 80GB A100s and it is not possible to increase the batch size further. How can we train ViT architectures with a lower batch size than yours?
Hi @anxiangsir I noticed when learning on your framework ( recognition/arcface_torch ), the loss jumps after each epoch, what do you think it could be from? Maybe because DistributedSampler ( utils.utils_distributed_sampler ) is not working correctly?
Hi @anxiangsir I observed that the ViT architectures converge when we use a CosFace margin (of 0.4), but do not converge (stagnate at a loss of about 20) when ArcFace (with a margin of 0.5) is used. Do you have any insights on why this would happen?
Hi, jacqueline-weng ,shreyanshdas00
you can try this:
We using gradient accumulation to insrease batchsize. Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially. Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.
You can increase config.gradient_acc
to increase total batchsize:
total_batchsize = config.batch_size * config.gradient_acc * worldsize
This is my result:
training log: https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/pfc03_wf42m_vit_b_8gpu/training.log
IJBC result:
+---------------+-------+-------+--------+-------+-------+-------+
| Methods | 1e-06 | 1e-05 | 0.0001 | 0.001 | 0.01 | 0.1 |
+---------------+-------+-------+--------+-------+-------+-------+
| IJBC.npy-IJBC | 92.96 | 97.05 | 97.91 | 98.47 | 98.86 | 99.34 |
+---------------+-------+-------+--------+-------+-------+-------+
Thanks @anxiangsir, I will try this out. Can you also share any insights you have on why ViT architectures do not converge when ArcFace loss is used? They seem to work fine when CosFace margin is used (as you have done in your experiments) but stagnate around a loss of 20 when ArcFace is used even though the two loss functions are quite similar, why could this be happening?
Despite loss not decreasing, does accuracy remain competitive on validation sets? I don't use this repo but I have a similar issue when I use a parallel model where loss stays around 20 yet the model is correctly converging and has > 98% IJBC
Thanks @anxiangsir, I'm trying with accumulative gradient. An error occurs at the first backward saying some parameters in the module are marked ready to reduce twice. After changing the module distribution code, it works.
backbone = torch.nn.parallel.DistributedDataParallel( module=backbone, broadcast_buffers=False, device_ids=[args.local_rank],find_unused_parameters=False)
(The original 'find_unused_parameters' is set to True.)
Thanks @anxiangsir, I will try this out. Can you also share any insights you have on why ViT architectures do not converge when ArcFace loss is used? They seem to work fine when CosFace margin is used (as you have done in your experiments) but stagnate around a loss of 20 when ArcFace is used even though the two loss functions are quite similar, why could this be happening?
From my experience, ArcFace is often more difficult to converge than CosFace no matter what the backbone is. ArcFace starts with a high loss (probably depends on margin setting) and remain higher during the training process. By nature, angular margins make logits more sensitive and precipitous than additive margins, thus harder to converge. My understanding is the margin hyper-parameter should be set properly and maybe dynamic during training. Using different losses during different training stages maybe better choice than sticking to one.
Any ideas are welcomed.
@jacqueline-weng how about CurricularFace or AdaFace ?
Hi everyone,
@anxiangsir did you try using the accumulated gradient with a larger batch size for vit_b? Can you obtain similar results to the one here?
If not, what is your gut feeling about it?
Best, /M
Hi everyone. we also have same problem,bu we have solved it。Hope that can helps you
If use Arcface+Aadam ,firstly your learning rate must be 1e-3 and below , and secondly your margin must not be too large.
The result of my test is as follows: we have observed that the author uses CosFace and margin=0.4 , but when Arcface and AdamW are used together, the initial learning is very diffcult ,which will lead to the value Nan and not converge. Whe we reduce the margin of ArcFace,it will converge normally.
CosFace seems to converge more easily, while AdamW seems to calculate a larger and mosensitive gradient than SGD. If you use AdamW ,maybe you need to adjust your margin and lr.
Hi jacqueline-weng, this is my result :
Training: 2022-07-05 08:29:37,789-Speed 9036.35 samples/sec Loss 6.3236 LearningRate 0.000717 Epoch: 9 Global Step: 98720 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:29:42,322-Speed 9038.57 samples/sec Loss 6.2150 LearningRate 0.000717 Epoch: 9 Global Step: 98730 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:29:46,851-Speed 9045.42 samples/sec Loss 6.2928 LearningRate 0.000717 Epoch: 9 Global Step: 98740 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:29:51,379-Speed 9049.36 samples/sec Loss 6.3178 LearningRate 0.000717 Epoch: 9 Global Step: 98750 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:29:55,904-Speed 9054.48 samples/sec Loss 6.2231 LearningRate 0.000717 Epoch: 9 Global Step: 98760 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:00,433-Speed 9047.04 samples/sec Loss 6.2004 LearningRate 0.000717 Epoch: 9 Global Step: 98770 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:04,969-Speed 9031.39 samples/sec Loss 6.2435 LearningRate 0.000717 Epoch: 9 Global Step: 98780 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:30:09,478-Speed 9087.44 samples/sec Loss 6.2260 LearningRate 0.000716 Epoch: 9 Global Step: 98790 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:14,009-Speed 9041.70 samples/sec Loss 6.2667 LearningRate 0.000716 Epoch: 9 Global Step: 98800 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:18,544-Speed 9032.99 samples/sec Loss 6.2365 LearningRate 0.000716 Epoch: 9 Global Step: 98810 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:23,069-Speed 9054.91 samples/sec Loss 6.1696 LearningRate 0.000716 Epoch: 9 Global Step: 98820 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:27,599-Speed 9044.99 samples/sec Loss 6.2435 LearningRate 0.000716 Epoch: 9 Global Step: 98830 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:32,143-Speed 9016.28 samples/sec Loss 6.3440 LearningRate 0.000716 Epoch: 9 Global Step: 98840 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:36,677-Speed 9035.36 samples/sec Loss 6.2971 LearningRate 0.000716 Epoch: 9 Global Step: 98850 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:41,205-Speed 9050.01 samples/sec Loss 6.2351 LearningRate 0.000716 Epoch: 9 Global Step: 98860 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:45,744-Speed 9025.46 samples/sec Loss 6.2534 LearningRate 0.000716 Epoch: 9 Global Step: 98870 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:50,278-Speed 9035.56 samples/sec Loss 6.2644 LearningRate 0.000716 Epoch: 9 Global Step: 98880 Fp16 Grad Scale: 32768 Required: 41 hours Training: 2022-07-05 08:30:54,831-Speed 8999.76 samples/sec Loss 6.2196 LearningRate 0.000716 Epoch: 9 Global Step: 98890 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:30:59,455-Speed 8858.93 samples/sec Loss 6.2450 LearningRate 0.000716 Epoch: 9 Global Step: 98900 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:31:03,987-Speed 9040.14 samples/sec Loss 6.2852 LearningRate 0.000716 Epoch: 9 Global Step: 98910 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:31:08,519-Speed 9040.10 samples/sec Loss 6.2288 LearningRate 0.000716 Epoch: 9 Global Step: 98920 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:31:13,056-Speed 9031.09 samples/sec Loss 6.3165 LearningRate 0.000716 Epoch: 9 Global Step: 98930 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:31:17,586-Speed 9044.34 samples/sec Loss 6.2852 LearningRate 0.000716 Epoch: 9 Global Step: 98940 Fp16 Grad Scale: 65536 Required: 41 hours Training: 2022-07-05 08:31:22,101-Speed 9073.97 samples/sec Loss 6.2712 LearningRate 0.000716 Epoch: 9 Global Step: 98950 Fp16 Grad Scale: 32768 Required: 40 hours Training: 2022-07-05 08:31:26,625-Speed 9057.15 samples/sec Loss 6.2816 LearningRate 0.000716 Epoch: 9 Global Step: 98960 Fp16 Grad Scale: 32768 Required: 40 hours Training: 2022-07-05 08:31:31,154-Speed 9046.29 samples/sec Loss 6.2800 LearningRate 0.000716 Epoch: 9 Global Step: 98970 Fp16 Grad Scale: 32768 Required: 40 hours Training: 2022-07-05 08:31:35,684-Speed 9043.57 samples/sec Loss 6.2846 LearningRate 0.000716 Epoch: 9 Global Step: 98980 Fp16 Grad Scale: 32768 Required: 40 hours
Hi @anxiangsir I follow the configuration as yours(from this issue and https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/wf42m_pfc02_40epoch_8gpu_vit_t/training.log) , and train on a1004gpu & 30908gpu, total batchsize is always 2048. However, the loss does not drop as your result.
NCCL version 2.10.3+cuda11.1
Training: 2023-05-25 16:21:01,319-: margin_list [1.0, 0.0, 0.4]
Training: 2023-05-25 16:21:01,326-: network vit_t_dp005_mask0
Training: 2023-05-25 16:21:01,326-: resume False
Training: 2023-05-25 16:21:01,326-: resume_checkpoint None
Training: 2023-05-25 16:21:01,326-: load_pretrained None
Training: 2023-05-25 16:21:01,326-: save_all_states True
Training: 2023-05-25 16:21:01,326-: output /model_dir
Training: 2023-05-25 16:21:01,326-: embedding_size 512
Training: 2023-05-25 16:21:01,326-: sample_rate 0.2
Training: 2023-05-25 16:21:01,326-: interclass_filtering_threshold0
Training: 2023-05-25 16:21:01,326-: fp16 True
Training: 2023-05-25 16:21:01,326-: batch_size 512
Training: 2023-05-25 16:21:01,326-: optimizer adamw
Training: 2023-05-25 16:21:01,326-: lr 0.001
Training: 2023-05-25 16:21:01,326-: momentum 0.9
Training: 2023-05-25 16:21:01,326-: weight_decay 0.1
Training: 2023-05-25 16:21:01,326-: verbose 2000
Training: 2023-05-25 16:21:01,326-: frequent 10
Training: 2023-05-25 16:21:01,326-: dali True
Training: 2023-05-25 16:21:01,326-: dali_aug False
Training: 2023-05-25 16:21:01,326-: gradient_acc 1
Training: 2023-05-25 16:21:01,326-: seed 2048
Training: 2023-05-25 16:21:01,326-: num_workers 2
Training: 2023-05-25 16:21:01,326-: wandb_key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Training: 2023-05-25 16:21:01,326-: suffix_run_name None
Training: 2023-05-25 16:21:01,327-: using_wandb False
Training: 2023-05-25 16:21:01,327-: wandb_entity entity
Training: 2023-05-25 16:21:01,327-: wandb_project project
Training: 2023-05-25 16:21:01,327-: wandb_log_all True
Training: 2023-05-25 16:21:01,327-: save_artifacts False
Training: 2023-05-25 16:21:01,327-: wandb_resume False
Training: 2023-05-25 16:21:01,327-: rec ./wf42m_mx
Training: 2023-05-25 16:21:01,327-: num_classes 2059906
Training: 2023-05-25 16:21:01,327-: num_image 42474557
Training: 2023-05-25 16:21:01,327-: num_epoch 40
Training: 2023-05-25 16:21:01,327-: warmup_epoch 4
Training: 2023-05-25 16:21:01,327-: val_targets []
Training: 2023-05-25 16:21:01,327-: total_batch_size 2048
Training: 2023-05-25 16:21:01,327-: warmup_step 82956
Training: 2023-05-25 16:21:01,327-: total_step 829560
Training: 2023-05-25 16:21:02,386-Reducer buckets have been rebuilt in this iteration.
Training: 2023-05-25 16:21:08,387-Speed 6156.32 samples/sec Loss 42.4324 LearningRate 0.000000 Epoch: 0 Global Step: 20 Fp16 Grad Scale: 65536 Required: 77 hours
Training: 2023-05-25 16:21:11,717-Speed 6154.12 samples/sec Loss 42.4117 LearningRate 0.000000 Epoch: 0 Global Step: 30 Fp16 Grad Scale: 65536 Required: 74 hours
Training: 2023-05-25 16:21:15,039-Speed 6165.80 samples/sec Loss 42.4171 LearningRate 0.000000 Epoch: 0 Global Step: 40 Fp16 Grad Scale: 65536 Required: 73 hours
Training: 2023-05-25 16:21:18,361-Speed 6167.04 samples/sec Loss 42.4250 LearningRate 0.000001 Epoch: 0 Global Step: 50 Fp16 Grad Scale: 65536 Required: 77 hours
Training: 2023-05-25 16:21:21,686-Speed 6161.36 samples/sec Loss 42.4015 LearningRate 0.000001 Epoch: 0 Global Step: 60 Fp16 Grad Scale: 65536 Required: 76 hours
Epoch 9 ( loss is about 6 in your traininglog)
Training: 2023-05-26 09:40:40,291-Speed 6146.81 samples/sec Loss 10.5731 LearningRate 0.000861 Epoch: 8 Global Step: 186570 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:40:43,623-Speed 6148.40 samples/sec Loss 10.3575 LearningRate 0.000861 Epoch: 8 Global Step: 186580 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:40:46,953-Speed 6150.55 samples/sec Loss 10.4854 LearningRate 0.000861 Epoch: 8 Global Step: 186590 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:40:50,284-Speed 6148.70 samples/sec Loss 10.4781 LearningRate 0.000861 Epoch: 8 Global Step: 186600 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:40:53,615-Speed 6149.34 samples/sec Loss 10.5193 LearningRate 0.000861 Epoch: 8 Global Step: 186610 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:40:56,947-Speed 6147.19 samples/sec Loss 10.4351 LearningRate 0.000861 Epoch: 8 Global Step: 186620 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:00,282-Speed 6142.53 samples/sec Loss 10.5754 LearningRate 0.000861 Epoch: 8 Global Step: 186630 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:03,612-Speed 6151.46 samples/sec Loss 10.4418 LearningRate 0.000861 Epoch: 8 Global Step: 186640 Fp16 Grad Scale: 65536 Required: 60 hours
Training: 2023-05-26 09:41:06,945-Speed 6144.56 samples/sec Loss 10.4398 LearningRate 0.000861 Epoch: 8 Global Step: 186650 Fp16 Grad Scale: 65536 Required: 60 hours
Training: 2023-05-26 09:41:10,778-Speed 5344.15 samples/sec Loss 10.4727 LearningRate 0.000861 Epoch: 9 Global Step: 186660 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:14,113-Speed 6142.45 samples/sec Loss 10.3351 LearningRate 0.000861 Epoch: 9 Global Step: 186670 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:17,444-Speed 6149.21 samples/sec Loss 10.5266 LearningRate 0.000861 Epoch: 9 Global Step: 186680 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:20,776-Speed 6146.86 samples/sec Loss 10.4695 LearningRate 0.000861 Epoch: 9 Global Step: 186690 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:24,107-Speed 6149.43 samples/sec Loss 10.4526 LearningRate 0.000861 Epoch: 9 Global Step: 186700 Fp16 Grad Scale: 32768 Required: 60 hours
Training: 2023-05-26 09:41:27,440-Speed 6146.48 samples/sec Loss 10.4886 LearningRate 0.000861 Epoch: 9 Global Step: 186710 Fp16 Grad Scale: 32768 Required: 60 hours
Final 40th epoch
Training: 2023-05-28 21:18:16,789-Speed 6147.49 samples/sec Loss 3.3835 LearningRate 0.000000 Epoch: 39 Global Step: 829460 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:20,121-Speed 6146.14 samples/sec Loss 3.4250 LearningRate 0.000000 Epoch: 39 Global Step: 829470 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:23,456-Speed 6142.79 samples/sec Loss 3.4518 LearningRate 0.000000 Epoch: 39 Global Step: 829480 Fp16 Grad Scale: 16384 Required: 0 hours
Training: 2023-05-28 21:18:26,764-Speed 6192.20 samples/sec Loss 3.4075 LearningRate 0.000000 Epoch: 39 Global Step: 829490 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:30,096-Speed 6146.77 samples/sec Loss 3.3720 LearningRate 0.000000 Epoch: 39 Global Step: 829500 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:33,426-Speed 6151.32 samples/sec Loss 3.4363 LearningRate 0.000000 Epoch: 39 Global Step: 829510 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:36,756-Speed 6151.00 samples/sec Loss 3.4455 LearningRate 0.000000 Epoch: 39 Global Step: 829520 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:40,087-Speed 6148.99 samples/sec Loss 3.3558 LearningRate 0.000000 Epoch: 39 Global Step: 829530 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:43,420-Speed 6145.44 samples/sec Loss 3.4064 LearningRate 0.000000 Epoch: 39 Global Step: 829540 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:46,750-Speed 6152.61 samples/sec Loss 3.4125 LearningRate 0.000000 Epoch: 39 Global Step: 829550 Fp16 Grad Scale: 8192 Required: 0 hours
Training: 2023-05-28 21:18:50,078-Speed 6154.23 samples/sec Loss 3.3826 LearningRate 0.000000 Epoch: 39 Global Step: 829560 Fp16 Grad Scale: 8192 Required: -0 hours
Training: 2023-05-28 21:18:53,411-Speed 6145.60 samples/sec Loss 3.3920 LearningRate 0.000000 Epoch: 39 Global Step: 829570 Fp16 Grad Scale: 8192 Required: -0 hours
Training: 2023-05-28 21:18:56,740-Speed 6152.61 samples/sec Loss 3.4548 LearningRate 0.000000 Epoch: 39 Global Step: 829580 Fp16 Grad Scale: 8192 Required: -0 hours
In your training log, the final loss is about 2.5. Is there any trick I missed? Did you use dali_aug?
Hi, jacqueline-weng ,shreyanshdas00
you can try this:
We using gradient accumulation to insrease batchsize. Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially. Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.
You can increase
config.gradient_acc
to increase total batchsize:total_batchsize = config.batch_size * config.gradient_acc * worldsize
This is my result:
- training loss:
- training log: https://raw.githubusercontent.com/anxiangsir/insightface_arcface_log/master/pfc03_wf42m_vit_b_8gpu/training.log
- IJBC result:
+---------------+-------+-------+--------+-------+-------+-------+ | Methods | 1e-06 | 1e-05 | 0.0001 | 0.001 | 0.01 | 0.1 | +---------------+-------+-------+--------+-------+-------+-------+ | IJBC.npy-IJBC | 92.96 | 97.05 | 97.91 | 98.47 | 98.86 | 99.34 | +---------------+-------+-------+--------+-------+-------+-------+
I wonder if we need to average the accumulative gradients, like this add " loss = loss / cfg.gradient_acc " to below
''' loss: torch.Tensor = module_partial_fc(local_embeddings, local_labels) if cfg.fp16: amp.scale(loss).backward() if global_step % cfg.gradientacc == 0: amp.unscale(opt) torch.nn.utils.clip_gradnorm(backbone.parameters(), 5) amp.step(opt) amp.update() opt.zero_grad() '''
I tried to replicate the results for ViT Base model trained on WebFace42M but the model does not seem to converge. The loss starts at 53 and stagnates at about 22 after a few epochs of training. I have used the exact same config, with the max. learning rate scaled according to my batch size. I am using 4 GPUs for training and the config variables are as below:
` config.network = "vit_b"
config.embedding_size = 256
Partial FC
config.sample_rate = 1
config.fp16 = True config.batch_size = 1500
For AdamW
config.optimizer = "adamw" config.lr = 0.00025 config.weight_decay = 0.1
config.verbose = 1415 config.dali = False
config.rec = "/media/data/Webface42M_rec" config.num_classes = 2059906 config.num_epoch = 40 config.warmup_epoch = config.num_epoch//10 `
Do you have any insights on why this could be happening?
Any help would be highly appreciated @anxiangsir