starzhoume commented 1 year ago

1、I used windows10 2、I build datasets from LFW origin picture.

This error is blow: (face) D:\facerecogntion\insightface\recognition\arcface_torch>python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port=12581 train.py configs/ms1mv2_mbf NOTE: Redirects are currently not supported in Windows or MacOs. D:\pyproject\conda\envs\face\lib\site-packages\torch\distributed\launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [activate.navicat.com]:12581 (system error: 10049 - 在其上下文中，该请求的地址无效。). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [activate.navicat.com]:12581 (system error: 10049 - 在其上下文中，该请求的地址无效。). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [activate.navicat.com]:12581 (system error: 10049 - 在其上下文中，该请求的地址无效。). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [activate.navicat.com]:12581 (system error: 10049 - 在其上下文中，该请求的地址无效。). Training: 2022-10-01 08:49:58,472-rank_id: 0 D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\util\nest.py:1286: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working _pywrap_tensorflow.RegisterType("Mapping", _collections.Mapping) D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\util\nest.py:1287: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working _pywrap_tensorflow.RegisterType("Sequence", _collections.Sequence) D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\training\tracking\object_identity.py:61: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working class ObjectIdentityDictionary(collections.MutableMapping): D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\training\tracking\object_identity.py:112: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working class ObjectIdentitySet(collections.MutableSet): D:\pyproject\conda\envs\face\lib\site-packages\tensorflow\python\training\tracking\data_structures.py:374: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working class _ListWrapper(List, collections.MutableSequence, D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:23: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead. 'nearest': pil_image.NEAREST, D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:24: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead. 'bilinear': pil_image.BILINEAR, D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:25: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead. 'bicubic': pil_image.BICUBIC, D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:28: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead. if hasattr(pil_image, 'HAMMING'): D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:29: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead. _PIL_INTERPOLATION_METHODS['hamming'] = pil_image.HAMMING D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:30: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead. if hasattr(pil_image, 'BOX'): D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:31: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead. _PIL_INTERPOLATION_METHODS['box'] = pil_image.BOX D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:33: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead. if hasattr(pil_image, 'LANCZOS'): D:\pyproject\conda\envs\face\lib\site-packages\keras_preprocessing\image\utils.py:34: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead. _PIL_INTERPOLATION_METHODS['lanczos'] = pil_image.LANCZOS D:\pyproject\conda\envs\face\lib\site-packages\torch\nn\parallel\distributed.py:1737: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, _set_static_graph will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling _set_static_graph. "You passed find_unused_parameters=true to DistributedDataParallel, " Training: 2022-10-01 08:50:01,847-: margin_list [1.0, 0.5, 0.0] Training: 2022-10-01 08:50:01,848-: network mbf Training: 2022-10-01 08:50:01,849-: resume False Training: 2022-10-01 08:50:01,853-: save_all_states False Training: 2022-10-01 08:50:01,854-: output work_dirs\ms1mv2_mbf Training: 2022-10-01 08:50:01,854-: embedding_size 128 Training: 2022-10-01 08:50:01,855-: sample_rate 1.0 Training: 2022-10-01 08:50:01,856-: interclass_filtering_threshold0 Training: 2022-10-01 08:50:01,863-: fp16 True Training: 2022-10-01 08:50:01,863-: batch_size 2 Training: 2022-10-01 08:50:01,863-: optimizer sgd Training: 2022-10-01 08:50:01,864-: lr 0.1 Training: 2022-10-01 08:50:01,864-: momentum 0.9 Training: 2022-10-01 08:50:01,865-: weight_decay 0.0005 Training: 2022-10-01 08:50:01,873-: verbose 2000 Training: 2022-10-01 08:50:01,873-: frequent 10 Training: 2022-10-01 08:50:01,874-: dali False Training: 2022-10-01 08:50:01,874-: gradient_acc 1 Training: 2022-10-01 08:50:01,875-: seed 2048 Training: 2022-10-01 08:50:01,875-: num_workers 0 Training: 2022-10-01 08:50:01,875-: rec ../datasets/faces_lfw Training: 2022-10-01 08:50:01,876-: num_classes 6 Training: 2022-10-01 08:50:01,885-: num_image 58 Training: 2022-10-01 08:50:01,885-: num_epoch 5 Training: 2022-10-01 08:50:01,886-: warmup_epoch 0 Training: 2022-10-01 08:50:01,887-: val_targets ['lfw', 'cfp_fp', 'agedb_30'] Training: 2022-10-01 08:50:01,894-: total_batch_size 2 Training: 2022-10-01 08:50:01,894-: warmup_step 0 Training: 2022-10-01 08:50:01,895-: total_step 145 loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 torch.Size([12000, 3, 112, 112]) loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 loading bin 12000 loading bin 13000 torch.Size([14000, 3, 112, 112]) loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 torch.Size([12000, 3, 112, 112]) D:\pyproject\conda\envs\face\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) Training: 2022-10-01 08:50:45,541-Reducer buckets have been rebuilt in this iteration. Training: 2022-10-01 08:50:46,859-Speed 27.43 samples/sec Loss 38.3868 LearningRate 0.075510 Epoch: 0 Global Step: 20 Fp16 Grad Scale: 64 Required: 0 hours Training: 2022-10-01 08:50:47,678-Speed 24.54 samples/sec Loss 34.1343 LearningRate 0.064000 Epoch: 1 Global Step: 30 Fp16 Grad Scale: 64 Required: 0 hours Training: 2022-10-01 08:50:48,408-Speed 27.43 samples/sec Loss 34.6079 LearningRate 0.053441 Epoch: 1 Global Step: 40 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:49,147-Speed 27.06 samples/sec Loss 35.2048 LearningRate 0.043834 Epoch: 1 Global Step: 50 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:49,936-Speed 25.35 samples/sec Loss 34.2611 LearningRate 0.035177 Epoch: 2 Global Step: 60 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:50,683-Speed 26.77 samples/sec Loss 34.6517 LearningRate 0.027472 Epoch: 2 Global Step: 70 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:51,426-Speed 26.92 samples/sec Loss 32.0162 LearningRate 0.020718 Epoch: 2 Global Step: 80 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:52,204-Speed 25.74 samples/sec Loss 31.9312 LearningRate 0.014916 Epoch: 3 Global Step: 90 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:52,952-Speed 26.77 samples/sec Loss 31.7183 LearningRate 0.010064 Epoch: 3 Global Step: 100 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:53,692-Speed 27.03 samples/sec Loss 35.9666 LearningRate 0.006164 Epoch: 3 Global Step: 110 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:54,470-Speed 25.74 samples/sec Loss 30.7639 LearningRate 0.003215 Epoch: 4 Global Step: 120 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:55,206-Speed 27.21 samples/sec Loss 29.2269 LearningRate 0.001218 Epoch: 4 Global Step: 130 Fp16 Grad Scale: 32 Required: 0 hours Training: 2022-10-01 08:50:55,939-Speed 27.32 samples/sec Loss 35.6505 LearningRate 0.000171 Epoch: 4 Global Step: 140 Fp16 Grad Scale: 32 Required: 0 hours Exception ignored in: <function MXRecordIO.del at 0x000001B30E0313A8> Traceback (most recent call last): File "D:\pyproject\conda\envs\face\lib\site-packages\mxnet\recordio.py", line 84, in del File "D:\pyproject\conda\envs\face\lib\site-packages\mxnet\recordio.py", line 217, in close TypeError: super() argument 1 must be type, not None

3、How to deal with it?
This is building datasets having error or others?

Best regards, Star

starzhoume commented 1 year ago

When I use faces_emore datasets, the training is good！

starzhoume commented 1 year ago

The ms1mv2_mbf.py is blow:

from easydict import EasyDict as edict

make training faster

our RAM is 256G

mount -t tmpfs -o size=140G tmpfs /train_tmp

config = edict() config.margin_list = (1.0, 0.5, 0.0) config.network = "mbf" config.resume = False config.output = None config.embedding_size = 128 config.sample_rate = 1.0 config.fp16 = True config.momentum = 0.9 config.weight_decay = 5e-4 config.batch_size = 2 config.lr = 0.1 config.verbose = 2000 config.dali = False

config.rec = "../datasets/faces_lfw" config.num_classes = 6 config.num_image = 58

config.num_epoch = 5 config.warmup_epoch = 0 config.val_targets = ['lfw', 'cfp_fp', "agedb_30"]

starzhoume commented 1 year ago

Training: 2022-10-01 17:04:33,580-Speed 77.13 samples/sec Loss 8.5295 LearningRate 0.000517 Epoch: 55 Global Step: 1950 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:04:41,882-Speed 77.09 samples/sec Loss 8.7947 LearningRate 0.000451 Epoch: 55 Global Step: 1960 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:04:50,803-Speed 71.74 samples/sec Loss 8.6475 LearningRate 0.000389 Epoch: 56 Global Step: 1970 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:04:59,227-Speed 75.97 samples/sec Loss 8.4973 LearningRate 0.000332 Epoch: 56 Global Step: 1980 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:05:07,638-Speed 76.10 samples/sec Loss 8.5923 LearningRate 0.000279 Epoch: 56 Global Step: 1990 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:05:16,356-Speed 73.41 samples/sec Loss 8.4769 LearningRate 0.000231 Epoch: 57 Global Step: 2000 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:07:24,770-[lfw][2000]XNorm: 7.471885 Training: 2022-10-01 17:07:24,770-[lfw][2000]Accuracy-Flip: 0.59833+-0.02929 Training: 2022-10-01 17:07:24,770-[lfw][2000]Accuracy-Highest: 0.59833 Training: 2022-10-01 17:09:51,601-[cfp_fp][2000]XNorm: 8.437674 Training: 2022-10-01 17:09:51,601-[cfp_fp][2000]Accuracy-Flip: 0.54029+-0.01130 Training: 2022-10-01 17:09:51,601-[cfp_fp][2000]Accuracy-Highest: 0.54200 Training: 2022-10-01 17:11:57,545-[agedb_30][2000]XNorm: 8.578558 Training: 2022-10-01 17:11:57,545-[agedb_30][2000]Accuracy-Flip: 0.54233+-0.02018 Training: 2022-10-01 17:11:57,545-[agedb_30][2000]Accuracy-Highest: 0.54233 Training: 2022-10-01 17:12:05,558-Speed 1.56 samples/sec Loss 8.4294 LearningRate 0.000188 Epoch: 57 Global Step: 2010 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:13,703-Speed 78.57 samples/sec Loss 8.6742 LearningRate 0.000149 Epoch: 57 Global Step: 2020 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:21,909-Speed 77.99 samples/sec Loss 8.8757 LearningRate 0.000114 Epoch: 57 Global Step: 2030 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:30,618-Speed 73.49 samples/sec Loss 8.4928 LearningRate 0.000084 Epoch: 58 Global Step: 2040 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:38,987-Speed 76.62 samples/sec Loss 8.6366 LearningRate 0.000059 Epoch: 58 Global Step: 2050 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:47,211-Speed 77.82 samples/sec Loss 8.4482 LearningRate 0.000038 Epoch: 58 Global Step: 2060 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:12:55,842-Speed 74.15 samples/sec Loss 8.8024 LearningRate 0.000022 Epoch: 59 Global Step: 2070 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:13:04,107-Speed 77.58 samples/sec Loss 8.4868 LearningRate 0.000010 Epoch: 59 Global Step: 2080 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:13:12,392-Speed 77.25 samples/sec Loss 8.4619 LearningRate 0.000003 Epoch: 59 Global Step: 2090 Fp16 Grad Scale: 8192 Required: 0 hours Training: 2022-10-01 17:13:20,621-Speed 77.77 samples/sec Loss 8.4113 LearningRate 0.000000 Epoch: 59 Global Step: 2100 Fp16 Grad Scale: 8192 Required: -0 hours Exception ignored in: <function MXRecordIO.del at 0x000001B30E0313A8> Traceback (most recent call last): File "D:\pyproject\conda\envs\face\lib\site-packages\mxnet\recordio.py", line 84, in del File "D:\pyproject\conda\envs\face\lib\site-packages\mxnet\recordio.py", line 217, in close TypeError: super() argument 1 must be type, not None

The total of the epoch is 60. So maybe termiated training will display errors as above.

Thanks.

deepinsight / insightface

How to deal with it when I use my self training datasets? #2121

make training faster

our RAM is 256G

mount -t tmpfs -o size=140G tmpfs /train_tmp