Training problem #721

Closed yonadance closed 3 months ago

yonadance commented 5 months ago

training problem:

  1. 我使用Visdrone数据集进行训练遇到了问题,在执行
    python tools/train_net.py --config-file ./configs/Visdrone/sbs_R50-ibn.yml MODEL.DEVICE "cuda:0"


  2. 由于在windows系统中没有进行make all的那一步操作
  3. 全部的log内容如下:
    Command Line Args: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
    [04/06 13:08:42 fastreid]: Rank of current process: 0. World size: 1
    [04/06 13:08:43 fastreid]: Environment info:
    ----------------------  ------------------------------------------------------------------------------------
    sys.platform            win32
    Python                  3.7.16 (default, Jan 17 2023, 16:06:28) [MSC v.1916 64 bit (AMD64)]
    numpy                   1.21.6
    fastreid                1.3 @.\fastreid
    FASTREID_ENV_MODULE     <not set>
    PyTorch                 1.13.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torch
    PyTorch debug build     False
    GPU available           True
    GPU 0                   NVIDIA GeForce RTX 3080
    CUDA_HOME               C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7
    Pillow                  9.5.0
    torchvision             0.14.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torchvision
    torchvision arch flags  D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\_C.pyd; cannot find cuobjdump
    cv2                     4.9.0
    ----------------------  ------------------------------------------------------------------------------------
    PyTorch built with:
    - C++ Version: 199711
    - MSVC 192829337
    - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
    - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
    - OpenMP 2019
    - LAPACK is enabled (usually provided by MKL)
    - CPU capability usage: AVX2
    - CUDA Runtime 11.7
    - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
    - CuDNN 8.5
    - Magma 2.5.4

[04/06 13:08:43 fastreid]: Command line arguments: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False) [04/06 13:08:43 fastreid]: Contents of args.config_file=./configs/Visdrone/sbsR50-ibn.yml: b'# coding:utf-8 _\r\nBASE: ../Base-SBS.yml\r\n\r\n# \xe8\xae\xbe\xe7\xbd\xae\xe7\x9b\xb8\xe5\xba\x94\xe7\x9a\x84\xe6\x95\xb0\xe6\x8d\xae\xe5\xa2\x9e\xe5\xbc\xba\r\nINPUT:\r\n SIZE_TRAIN: [256, 256]\r\n SIZE_TEST: [256, 256]\r\n\r\nMODEL:\r\n BACKBONE:\r\n WITH_IBN: True\r\n WITH_NL: True #\xe6\xa8\xa1\xe5\x9e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe4\xbd\xbf\xe7\x94\xa8No_local module\r\n PRETRAIN: True\r\n PRETRAIN_PATH: \'pretrained\veri_sbs_R50-ibn.pth\'\r\n HEADS:\r\n POOL_LAYER: GeneralizedMeanPooling # HEAD POOL_LAYERS\r\n LOSSES:\r\n NAME: ("CrossEntropyLoss", "TripletLoss",)\r\n CE:\r\n EPSILON: 0.1\r\n SCALE: 1.0\r\n\r\n TRI:\r\n MARGIN: 0.0 # \xe8\x80\x83\xe8\x99\x91\xe8\xa6\x81\xe4\xb8\x8d\xe8\xa6\x81\xe8\xbf\x9b\xe8\xa1\x8c\xe5\xaf\xb9\xe5\xba\x94\xe7\x9a\x84\xe8\xb6\x85\xe5\x8f\x82\xe6\x95\xb0\xe7\x9a\x84\xe8\xb0\x83\xe6\x95\xb4\r\n HARD_MINING: True\r\n NORM_FEAT: False\r\n SCALE: 1.0\r\nSOLVER:\r\n OPT: SGD\r\n BASE_LR: 0.0001# 0.01\r\n ETA_MIN_LR: 7.7e-5\r\n\r\n IMS_PER_BATCH: 128 # batchsize\r\n MAX_EPOCH: 10 # 60\r\n WARMUP_ITERS: 3000\r\n FREEZE_ITERS: 3000\r\n\r\n CHECKPOINT_PERIOD: 10\r\n\r\nDATASETS:\r\n NAMES: ("Visdrone",)\r\n TESTS: ("Visdrone",)\r\n\r\nDATALOADER:\r\n SAMPLER_TRAIN: BalancedIdentitySampler\r\n NUM_INSTANCE: 4\r\n NUM_WORKERS: 8\r\nTEST:\r\n EVAL_PERIOD: 10\r\n IMS_PER_BATCH: 256 # 256\r\n\r\nOUTPUT_DIR: logs/visdrone/sbs_R50-ibn' [04/06 13:08:43 fastreid]: Running with full config: CUDNN_BENCHMARK: False DATALOADER: NUM_INSTANCE: 4 NUM_WORKERS: 8 SAMPLER_TRAIN: BalancedIdentitySampler SET_WEIGHT: [] DATASETS: COMBINEALL: False NAMES: ('Visdrone',) TESTS: ('Visdrone',) INPUT: AFFINE: ENABLED: False AUGMIX: ENABLED: False PROB: 0.0 AUTOAUG: ENABLED: True PROB: 0.1 CJ: BRIGHTNESS: 0.15 CONTRAST: 0.15 ENABLED: False HUE: 0.1 PROB: 0.5 SATURATION: 0.1 CROP: ENABLED: False RATIO: [0.75, 1.3333333333333333] SCALE: [0.16, 1] SIZE: [224, 224] FLIP: ENABLED: True PROB: 0.5 PADDING: ENABLED: True MODE: constant SIZE: 10 REA: ENABLED: True PROB: 0.5 VALUE: [123.675, 116.28, 103.53] RPT: ENABLED: False PROB: 0.5 SIZE_TEST: [256, 256] SIZE_TRAIN: [256, 256] KD: EMA: ENABLED: False MOMENTUM: 0.999 MODEL_CONFIG: [] MODEL_WEIGHTS: [] MODEL: BACKBONE: ATT_DROP_RATE: 0.0 DEPTH: 50x DROP_PATH_RATIO: 0.1 DROP_RATIO: 0.0 FEAT_DIM: 2048 LAST_STRIDE: 1 NAME: build_resnet_backbone NORM: BN PRETRAIN: True PRETRAIN_PATH: pretrained\veri_sbs_R50-ibn.pth SIE_COE: 3.0 STRIDE_SIZE: (16, 16) WITH_IBN: True WITH_NL: True WITH_SE: False DEVICE: cuda:0 FREEZE_LAYERS: ['backbone'] HEADS: CLS_LAYER: CircleSoftmax EMBEDDING_DIM: 0 MARGIN: 0.35 NAME: EmbeddingHead NECK_FEAT: after NORM: BN NUM_CLASSES: 0 POOL_LAYER: GeneralizedMeanPooling SCALE: 64 WITH_BNNECK: True LOSSES: CE: ALPHA: 0.2 EPSILON: 0.1 SCALE: 1.0 CIRCLE: GAMMA: 128 MARGIN: 0.25 SCALE: 1.0 COSFACE: GAMMA: 128 MARGIN: 0.25 SCALE: 1.0 FL: ALPHA: 0.25 GAMMA: 2 SCALE: 1.0 NAME: ('CrossEntropyLoss', 'TripletLoss') TRI: HARD_MINING: True MARGIN: 0.0 NORM_FEAT: False SCALE: 1.0 META_ARCHITECTURE: Baseline PIXEL_MEAN: [123.675, 116.28, 103.53] PIXEL_STD: [58.395, 57.120000000000005, 57.375] QUEUE_SIZE: 8192 WEIGHTS: OUTPUT_DIR: logs/visdrone/sbs_R50-ibn SOLVER: AMP: ENABLED: True BASE_LR: 0.0001 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 10 CLIP_GRADIENTS: CLIP_TYPE: norm CLIP_VALUE: 5.0 ENABLED: False NORM_TYPE: 2.0 DELAY_EPOCHS: 30 ETA_MIN_LR: 7.7e-05 FREEZE_ITERS: 3000 GAMMA: 0.1 HEADS_LR_FACTOR: 1.0 IMS_PER_BATCH: 128 MAX_EPOCH: 10 MOMENTUM: 0.9 NESTEROV: False OPT: SGD SCHED: CosineAnnealingLR STEPS: [40, 90] WARMUP_FACTOR: 0.1 WARMUP_ITERS: 3000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0005 WEIGHT_DECAY_BIAS: 0.0005 WEIGHT_DECAY_NORM: 0.0005 TEST: AQE: ALPHA: 3.0 ENABLED: False QE_K: 5 QE_TIME: 1 EVAL_PERIOD: 10 FLIP: ENABLED: False IMS_PER_BATCH: 256 METRIC: cosine PRECISE_BN: DATASET: Market1501 ENABLED: False NUM_ITER: 300 RERANK: ENABLED: False K1: 20 K2: 6 LAMBDA: 0.3 ROC: ENABLED: False [04/06 13:08:43 fastreid]: Full config saved to D:\zhuangshilin\BoT_SORT\fast_reid\logs\visdrone\sbs_R50-ibn\config.yaml D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\transforms\transforms.py:330: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum. "Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "

## Expected behavior:
yonadance commented 5 months ago

设置断点调试后发现卡在了: fastreid.engine.train_loop 中的 class AMPTrainer中的 super().__init__(model, data_loader, optimizer, param_wrapper) 无法执行下去

yonadance commented 5 months ago


yonadance commented 5 months ago


