Open leoluopy opened 3 years ago
For example: GPU_0 (rank 0) class center's range is (0, 45k) GPU_1 (rank 1) class center's range is (45k, 90k) ...... GPU_7 (rank 7) class center's range is (315k, 360k)
Because they are different weight, model will not sync different gradient in different gpus, then each GPU is accountable for calculating the sum of the dot product of sub-matrix that is stored on its own and input features. After that each GPU gathers the local sum from other GPUs to get the full-class softmax function. By only communicating the local sum, we can complete the softmax with only a small amount of communication.
Embedding Size: 512
World Size (Number of GPU): 8
C(Number of class centers): 1000 000
Batch Size: 128
Data Parallel | Model Parallel | |
---|---|---|
W | C Embedding Size 3 | C / World Size Embedding Size 3 |
Logits | C * Batch Size | C / World Size Batch Size C |
*3 means weight, gradient, and momentum.
Data Parallel | Model Parallel | |
---|---|---|
Communication | C * Embedding Size (500M) | (2 + 2 Embedding) Batch Size (0.1M) |
Data Parallel | Model Parallel | Model Parallel With Partial FC | |
---|---|---|---|
Number of Identites | 500K | 8 Millions | 25 Millions |
arcface_torch also can train on multi nodes, class centers will be partitioned into all GPUs.
@anxiangsir thanks for your quick reply . and how long dose it cost ? while you are training Glint dataset , using 8 GPUs ?
Training: 2021-03-16 00:16:03,415-Speed 3262.77 samples/sec Loss 34.2626 Epoch: 0 Global Step: 1250 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:16:19,780-Speed 3128.82 samples/sec Loss 33.9931 Epoch: 0 Global Step: 1300 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:16:35,901-Speed 3176.06 samples/sec Loss 33.6923 Epoch: 0 Global Step: 1350 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:16:53,313-Speed 2940.70 samples/sec Loss 33.3387 Epoch: 0 Global Step: 1400 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:17:12,419-Speed 2679.78 samples/sec Loss 33.0541 Epoch: 0 Global Step: 1450 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:17:29,129-Speed 3064.18 samples/sec Loss 32.6634 Epoch: 0 Global Step: 1500 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:17:45,230-Speed 3179.93 samples/sec Loss 32.4335 Epoch: 0 Global Step: 1550 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:18:01,016-Speed 3243.57 samples/sec Loss 32.0923 Epoch: 0 Global Step: 1600 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:18:16,899-Speed 3223.66 samples/sec Loss 31.7564 Epoch: 0 Global Step: 1650 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:18:33,028-Speed 3174.48 samples/sec Loss 31.4021 Epoch: 0 Global Step: 1700 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:18:48,854-Speed 3235.33 samples/sec Loss 31.1092 Epoch: 0 Global Step: 1750 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:19:04,634-Speed 3244.78 samples/sec Loss 30.7287 Epoch: 0 Global Step: 1800 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:19:20,688-Speed 3189.25 samples/sec Loss 30.4005 Epoch: 0 Global Step: 1850 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:19:37,478-Speed 3049.57 samples/sec Loss 30.0752 Epoch: 0 Global Step: 1900 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:19:53,631-Speed 3169.92 samples/sec Loss 29.7154 Epoch: 0 Global Step: 1950 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-16 00:20:09,726-Speed 3181.20 samples/sec Loss 29.4090 Epoch: 0 Global Step: 2000 Fp16 Grad Scale: 16384 Required: 31 hours
Training: 2021-03-14 21:52:53,887-Speed 4079.84 samples/sec Loss 37.9597 Epoch: 0 Global Step: 650 Fp16 Grad Scale: 8192 Required: 24 hours
Training: 2021-03-14 21:53:05,425-Speed 4437.64 samples/sec Loss 37.7011 Epoch: 0 Global Step: 700 Fp16 Grad Scale: 16384 Required: 24 hours
Training: 2021-03-14 21:53:17,055-Speed 4402.58 samples/sec Loss 37.3634 Epoch: 0 Global Step: 750 Fp16 Grad Scale: 16384 Required: 24 hours
Training: 2021-03-14 21:53:28,626-Speed 4424.86 samples/sec Loss 37.0828 Epoch: 0 Global Step: 800 Fp16 Grad Scale: 16384 Required: 24 hours
Training: 2021-03-14 21:53:40,234-Speed 4410.72 samples/sec Loss 36.8450 Epoch: 0 Global Step: 850 Fp16 Grad Scale: 16384 Required: 24 hours
Training: 2021-03-14 21:53:51,735-Speed 4452.12 samples/sec Loss 36.5827 Epoch: 0 Global Step: 900 Fp16 Grad Scale: 16384 Required: 24 hours
Training: 2021-03-14 21:54:03,278-Speed 4435.85 samples/sec Loss 36.3103 Epoch: 0 Global Step: 950 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:54:14,653-Speed 4501.39 samples/sec Loss 36.0483 Epoch: 0 Global Step: 1000 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:54:26,408-Speed 4355.53 samples/sec Loss 35.7620 Epoch: 0 Global Step: 1050 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:54:37,917-Speed 4448.95 samples/sec Loss 35.4236 Epoch: 0 Global Step: 1100 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:54:49,511-Speed 4416.14 samples/sec Loss 35.1475 Epoch: 0 Global Step: 1150 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:01,012-Speed 4452.12 samples/sec Loss 34.9056 Epoch: 0 Global Step: 1200 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:12,764-Speed 4357.08 samples/sec Loss 34.5845 Epoch: 0 Global Step: 1250 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:24,248-Speed 4458.36 samples/sec Loss 34.3024 Epoch: 0 Global Step: 1300 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:36,116-Speed 4314.40 samples/sec Loss 33.9949 Epoch: 0 Global Step: 1350 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:47,714-Speed 4414.60 samples/sec Loss 33.6711 Epoch: 0 Global Step: 1400 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:55:59,377-Speed 4390.22 samples/sec Loss 33.4009 Epoch: 0 Global Step: 1450 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:56:11,007-Speed 4402.57 samples/sec Loss 33.0303 Epoch: 0 Global Step: 1500 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:56:24,250-Speed 3866.31 samples/sec Loss 32.7610 Epoch: 0 Global Step: 1550 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:56:39,212-Speed 3422.07 samples/sec Loss 32.4523 Epoch: 0 Global Step: 1600 Fp16 Grad Scale: 16384 Required: 23 hours
Training: 2021-03-14 21:56:51,598-Speed 4133.89 samples/sec Loss 32.0828 Epoch: 0 Global Step: 1650 Fp16 Grad Scale: 16384 Required: 23 hours
The training logs and pretrain models will be released a few days.
Glint360K(20Epochs) + FP16 + V100*8 + PartialFC 0.1 + tmpfs + batchsize 128
+--------------------------------------+-------+-------+--------+-------+-------+-------+
| Methods | 1e-06 | 1e-05 | 0.0001 | 0.001 | 0.01 | 0.1 |
+--------------------------------------+-------+-------+--------+-------+-------+-------+
| glint360k_cosface_r50_fp16_0.1-IJBC | 91.48 | 95.61 | 96.97 | 97.98 | 98.71 | 99.29 |
| glint360k_cosface_r100_fp16_0.1-IJBC | 90.58 | 95.88 | 97.32 | 98.19 | 98.73 | 99.24 |
+--------------------------------------+-------+-------+--------+-------+-------+-------+
Batch Size: 128
@anxiangsir @leoluopy may i ask here 128 means batch_size_per_gpu = 16 in your setting? because you said you use 8 gpus and 16*8=128 tesla v100 has 32GB memory but you only use batch_size = 16? please correct me if i'm wrong, thank you!
@anxiangsir @leoluopy Hi, thank you for the explanation.
During the training, there are two files being saved, softmax_weight and softmax_weight_mom. First one, I understand because of model parallel
training you are saving weight matrix locally for each rank, consequently bringing them to calculate the final softmax.
Does softmax_weight_mom stand for SGD momentum and have the same shape as softmax_weight (C * embedding size)?
Thank you in advance.
Yes
I found in the code , that the backbone in wrapped by DistributedDataParallel , so the model will sync different gradient in different gpus , but conversely the module_partial_fc isn't wrapped by DistributedDataParallel , will the center's weight in different gpu be different ? and should the center be the same in different gpus ?
@anxiangsir