deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.9k stars 5.36k forks source link

[arcface_torch] train.py in arcface_torch DistributedDataParallel should wrap the module_partial_fc ? #1430

Open leoluopy opened 3 years ago

leoluopy commented 3 years ago

I found in the code , that the backbone in wrapped by DistributedDataParallel , so the model will sync different gradient in different gpus , but conversely the module_partial_fc isn't wrapped by DistributedDataParallel , will the center's weight in different gpu be different ? and should the center be the same in different gpus ?

@anxiangsir


    backbone = torch.nn.parallel.DistributedDataParallel(
        module=backbone, broadcast_buffers=False, device_ids=[local_rank])
    backbone.train()

    margin_softmax = eval("losses.{}".format(args.loss))()
    module_partial_fc = PartialFC(
        rank=rank, local_rank=local_rank, world_size=world_size, resume=args.resume,
        batch_size=cfg.batch_size, margin_softmax=margin_softmax, num_classes=cfg.num_classes,
        sample_rate=cfg.sample_rate, embedding_size=cfg.embedding_size, prefix=cfg.output)
anxiangsir commented 3 years ago
  1. Center's weight in different GPU is different.
  2. If train 360k identites on 8 GPU, class centers will be partitioned into 8 sub-matrices on each GPU.

For example: GPU_0 (rank 0) class center's range is (0, 45k) GPU_1 (rank 1) class center's range is (45k, 90k) ...... GPU_7 (rank 7) class center's range is (315k, 360k)

Because they are different weight, model will not sync different gradient in different gpus, then each GPU is accountable for calculating the sum of the dot product of sub-matrix that is stored on its own and input features. After that each GPU gathers the local sum from other GPUs to get the full-class softmax function. By only communicating the local sum, we can complete the softmax with only a small amount of communication.

Data parallel vs Model parallel

Embedding Size: 512
World Size (Number of GPU): 8 C(Number of class centers): 1000 000 Batch Size: 128

1. GPU Memory

Data Parallel Model Parallel
W C Embedding Size 3 C / World Size Embedding Size 3
Logits C * Batch Size C / World Size Batch Size C

*3 means weight, gradient, and momentum.

2. Communication

Data Parallel Model Parallel
Communication C * Embedding Size (500M) (2 + 2 Embedding) Batch Size (0.1M)

3. Maximum number of Identities

Data Parallel Model Parallel Model Parallel With Partial FC
Number of Identites 500K 8 Millions 25 Millions

TIPS

arcface_torch also can train on multi nodes, class centers will be partitioned into all GPUs.

leoluopy commented 3 years ago

@anxiangsir thanks for your quick reply . and how long dose it cost ? while you are training Glint dataset , using 8 GPUs ?

anxiangsir commented 3 years ago

When training with iresnet100, it will take 32 hours.

Training: 2021-03-16 00:16:03,415-Speed 3262.77 samples/sec   Loss 34.2626   Epoch: 0   Global Step: 1250   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:16:19,780-Speed 3128.82 samples/sec   Loss 33.9931   Epoch: 0   Global Step: 1300   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:16:35,901-Speed 3176.06 samples/sec   Loss 33.6923   Epoch: 0   Global Step: 1350   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:16:53,313-Speed 2940.70 samples/sec   Loss 33.3387   Epoch: 0   Global Step: 1400   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:17:12,419-Speed 2679.78 samples/sec   Loss 33.0541   Epoch: 0   Global Step: 1450   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:17:29,129-Speed 3064.18 samples/sec   Loss 32.6634   Epoch: 0   Global Step: 1500   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:17:45,230-Speed 3179.93 samples/sec   Loss 32.4335   Epoch: 0   Global Step: 1550   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:18:01,016-Speed 3243.57 samples/sec   Loss 32.0923   Epoch: 0   Global Step: 1600   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:18:16,899-Speed 3223.66 samples/sec   Loss 31.7564   Epoch: 0   Global Step: 1650   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:18:33,028-Speed 3174.48 samples/sec   Loss 31.4021   Epoch: 0   Global Step: 1700   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:18:48,854-Speed 3235.33 samples/sec   Loss 31.1092   Epoch: 0   Global Step: 1750   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:19:04,634-Speed 3244.78 samples/sec   Loss 30.7287   Epoch: 0   Global Step: 1800   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:19:20,688-Speed 3189.25 samples/sec   Loss 30.4005   Epoch: 0   Global Step: 1850   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:19:37,478-Speed 3049.57 samples/sec   Loss 30.0752   Epoch: 0   Global Step: 1900   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:19:53,631-Speed 3169.92 samples/sec   Loss 29.7154   Epoch: 0   Global Step: 1950   Fp16 Grad Scale: 16384   Required: 31 hours
Training: 2021-03-16 00:20:09,726-Speed 3181.20 samples/sec   Loss 29.4090   Epoch: 0   Global Step: 2000   Fp16 Grad Scale: 16384   Required: 31 hours

When training with iresnet50, it will take 23 hours.

Training: 2021-03-14 21:52:53,887-Speed 4079.84 samples/sec   Loss 37.9597   Epoch: 0   Global Step: 650   Fp16 Grad Scale: 8192   Required: 24 hours
Training: 2021-03-14 21:53:05,425-Speed 4437.64 samples/sec   Loss 37.7011   Epoch: 0   Global Step: 700   Fp16 Grad Scale: 16384   Required: 24 hours
Training: 2021-03-14 21:53:17,055-Speed 4402.58 samples/sec   Loss 37.3634   Epoch: 0   Global Step: 750   Fp16 Grad Scale: 16384   Required: 24 hours
Training: 2021-03-14 21:53:28,626-Speed 4424.86 samples/sec   Loss 37.0828   Epoch: 0   Global Step: 800   Fp16 Grad Scale: 16384   Required: 24 hours
Training: 2021-03-14 21:53:40,234-Speed 4410.72 samples/sec   Loss 36.8450   Epoch: 0   Global Step: 850   Fp16 Grad Scale: 16384   Required: 24 hours
Training: 2021-03-14 21:53:51,735-Speed 4452.12 samples/sec   Loss 36.5827   Epoch: 0   Global Step: 900   Fp16 Grad Scale: 16384   Required: 24 hours
Training: 2021-03-14 21:54:03,278-Speed 4435.85 samples/sec   Loss 36.3103   Epoch: 0   Global Step: 950   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:54:14,653-Speed 4501.39 samples/sec   Loss 36.0483   Epoch: 0   Global Step: 1000   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:54:26,408-Speed 4355.53 samples/sec   Loss 35.7620   Epoch: 0   Global Step: 1050   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:54:37,917-Speed 4448.95 samples/sec   Loss 35.4236   Epoch: 0   Global Step: 1100   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:54:49,511-Speed 4416.14 samples/sec   Loss 35.1475   Epoch: 0   Global Step: 1150   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:01,012-Speed 4452.12 samples/sec   Loss 34.9056   Epoch: 0   Global Step: 1200   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:12,764-Speed 4357.08 samples/sec   Loss 34.5845   Epoch: 0   Global Step: 1250   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:24,248-Speed 4458.36 samples/sec   Loss 34.3024   Epoch: 0   Global Step: 1300   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:36,116-Speed 4314.40 samples/sec   Loss 33.9949   Epoch: 0   Global Step: 1350   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:47,714-Speed 4414.60 samples/sec   Loss 33.6711   Epoch: 0   Global Step: 1400   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:55:59,377-Speed 4390.22 samples/sec   Loss 33.4009   Epoch: 0   Global Step: 1450   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:56:11,007-Speed 4402.57 samples/sec   Loss 33.0303   Epoch: 0   Global Step: 1500   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:56:24,250-Speed 3866.31 samples/sec   Loss 32.7610   Epoch: 0   Global Step: 1550   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:56:39,212-Speed 3422.07 samples/sec   Loss 32.4523   Epoch: 0   Global Step: 1600   Fp16 Grad Scale: 16384   Required: 23 hours
Training: 2021-03-14 21:56:51,598-Speed 4133.89 samples/sec   Loss 32.0828   Epoch: 0   Global Step: 1650   Fp16 Grad Scale: 16384   Required: 23 hours

The training logs and pretrain models will be released a few days.

Configs:

Glint360K(20Epochs) + FP16 + V100*8 + PartialFC 0.1 + tmpfs + batchsize 128

+--------------------------------------+-------+-------+--------+-------+-------+-------+
|               Methods                | 1e-06 | 1e-05 | 0.0001 | 0.001 |  0.01 |  0.1  |
+--------------------------------------+-------+-------+--------+-------+-------+-------+
| glint360k_cosface_r50_fp16_0.1-IJBC  | 91.48 | 95.61 | 96.97  | 97.98 | 98.71 | 99.29 |
| glint360k_cosface_r100_fp16_0.1-IJBC | 90.58 | 95.88 | 97.32  | 98.19 | 98.73 | 99.24 |
+--------------------------------------+-------+-------+--------+-------+-------+-------+
Light-- commented 3 years ago

Batch Size: 128

@anxiangsir @leoluopy may i ask here 128 means batch_size_per_gpu = 16 in your setting? because you said you use 8 gpus and 16*8=128 tesla v100 has 32GB memory but you only use batch_size = 16? please correct me if i'm wrong, thank you!

OrkhanHI commented 2 years ago

@anxiangsir @leoluopy Hi, thank you for the explanation.

During the training, there are two files being saved, softmax_weight and softmax_weight_mom. First one, I understand because of model parallel training you are saving weight matrix locally for each rank, consequently bringing them to calculate the final softmax.

Does softmax_weight_mom stand for SGD momentum and have the same shape as softmax_weight (C * embedding size)?

Thank you in advance.

anxiangsir commented 2 years ago

Yes