JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.39k stars 830 forks source link

Training problem #721

Closed yonadance closed 3 months ago

yonadance commented 5 months ago

training problem:

  1. 我使用Visdrone数据集进行训练遇到了问题,在执行
    python tools/train_net.py --config-file ./configs/Visdrone/sbs_R50-ibn.yml MODEL.DEVICE "cuda:0"

    之后并没有产生报错但也没有进行到iteration中进行训练。

  2. 由于在windows系统中没有进行make all的那一步操作
  3. 全部的log内容如下:
    
    Command Line Args: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
    [04/06 13:08:42 fastreid]: Rank of current process: 0. World size: 1
    [04/06 13:08:43 fastreid]: Environment info:
    ----------------------  ------------------------------------------------------------------------------------
    sys.platform            win32
    Python                  3.7.16 (default, Jan 17 2023, 16:06:28) [MSC v.1916 64 bit (AMD64)]
    numpy                   1.21.6
    fastreid                1.3 @.\fastreid
    FASTREID_ENV_MODULE     <not set>
    PyTorch                 1.13.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torch
    PyTorch debug build     False
    GPU available           True
    GPU 0                   NVIDIA GeForce RTX 3080
    CUDA_HOME               C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7
    Pillow                  9.5.0
    torchvision             0.14.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torchvision
    torchvision arch flags  D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\_C.pyd; cannot find cuobjdump
    cv2                     4.9.0
    ----------------------  ------------------------------------------------------------------------------------
    PyTorch built with:
    - C++ Version: 199711
    - MSVC 192829337
    - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
    - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
    - OpenMP 2019
    - LAPACK is enabled (usually provided by MKL)
    - CPU capability usage: AVX2
    - CUDA Runtime 11.7
    - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
    - CuDNN 8.5
    - Magma 2.5.4
    - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

[04/06 13:08:43 fastreid]: Command line arguments: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False) [04/06 13:08:43 fastreid]: Contents of args.config_file=./configs/Visdrone/sbsR50-ibn.yml: b'# coding:utf-8 _\r\nBASE: ../Base-SBS.yml\r\n\r\n# \xe8\xae\xbe\xe7\xbd\xae\xe7\x9b\xb8\xe5\xba\x94\xe7\x9a\x84\xe6\x95\xb0\xe6\x8d\xae\xe5\xa2\x9e\xe5\xbc\xba\r\nINPUT:\r\n SIZE_TRAIN: [256, 256]\r\n SIZE_TEST: [256, 256]\r\n\r\nMODEL:\r\n BACKBONE:\r\n WITH_IBN: True\r\n WITH_NL: True #\xe6\xa8\xa1\xe5\x9e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe4\xbd\xbf\xe7\x94\xa8No_local module\r\n PRETRAIN: True\r\n PRETRAIN_PATH: \'pretrained\veri_sbs_R50-ibn.pth\'\r\n HEADS:\r\n POOL_LAYER: GeneralizedMeanPooling # HEAD POOL_LAYERS\r\n LOSSES:\r\n NAME: ("CrossEntropyLoss", "TripletLoss",)\r\n CE:\r\n EPSILON: 0.1\r\n SCALE: 1.0\r\n\r\n TRI:\r\n MARGIN: 0.0 # \xe8\x80\x83\xe8\x99\x91\xe8\xa6\x81\xe4\xb8\x8d\xe8\xa6\x81\xe8\xbf\x9b\xe8\xa1\x8c\xe5\xaf\xb9\xe5\xba\x94\xe7\x9a\x84\xe8\xb6\x85\xe5\x8f\x82\xe6\x95\xb0\xe7\x9a\x84\xe8\xb0\x83\xe6\x95\xb4\r\n HARD_MINING: True\r\n NORM_FEAT: False\r\n SCALE: 1.0\r\nSOLVER:\r\n OPT: SGD\r\n BASE_LR: 0.0001# 0.01\r\n ETA_MIN_LR: 7.7e-5\r\n\r\n IMS_PER_BATCH: 128 # batchsize\r\n MAX_EPOCH: 10 # 60\r\n WARMUP_ITERS: 3000\r\n FREEZE_ITERS: 3000\r\n\r\n CHECKPOINT_PERIOD: 10\r\n\r\nDATASETS:\r\n NAMES: ("Visdrone",)\r\n TESTS: ("Visdrone",)\r\n\r\nDATALOADER:\r\n SAMPLER_TRAIN: BalancedIdentitySampler\r\n NUM_INSTANCE: 4\r\n NUM_WORKERS: 8\r\nTEST:\r\n EVAL_PERIOD: 10\r\n IMS_PER_BATCH: 256 # 256\r\n\r\nOUTPUT_DIR: logs/visdrone/sbs_R50-ibn' [04/06 13:08:43 fastreid]: Running with full config: CUDNN_BENCHMARK: False DATALOADER: NUM_INSTANCE: 4 NUM_WORKERS: 8 SAMPLER_TRAIN: BalancedIdentitySampler SET_WEIGHT: [] DATASETS: COMBINEALL: False NAMES: ('Visdrone',) TESTS: ('Visdrone',) INPUT: AFFINE: ENABLED: False AUGMIX: ENABLED: False PROB: 0.0 AUTOAUG: ENABLED: True PROB: 0.1 CJ: BRIGHTNESS: 0.15 CONTRAST: 0.15 ENABLED: False HUE: 0.1 PROB: 0.5 SATURATION: 0.1 CROP: ENABLED: False RATIO: [0.75, 1.3333333333333333] SCALE: [0.16, 1] SIZE: [224, 224] FLIP: ENABLED: True PROB: 0.5 PADDING: ENABLED: True MODE: constant SIZE: 10 REA: ENABLED: True PROB: 0.5 VALUE: [123.675, 116.28, 103.53] RPT: ENABLED: False PROB: 0.5 SIZE_TEST: [256, 256] SIZE_TRAIN: [256, 256] KD: EMA: ENABLED: False MOMENTUM: 0.999 MODEL_CONFIG: [] MODEL_WEIGHTS: [] MODEL: BACKBONE: ATT_DROP_RATE: 0.0 DEPTH: 50x DROP_PATH_RATIO: 0.1 DROP_RATIO: 0.0 FEAT_DIM: 2048 LAST_STRIDE: 1 NAME: build_resnet_backbone NORM: BN PRETRAIN: True PRETRAIN_PATH: pretrained\veri_sbs_R50-ibn.pth SIE_COE: 3.0 STRIDE_SIZE: (16, 16) WITH_IBN: True WITH_NL: True WITH_SE: False DEVICE: cuda:0 FREEZE_LAYERS: ['backbone'] HEADS: CLS_LAYER: CircleSoftmax EMBEDDING_DIM: 0 MARGIN: 0.35 NAME: EmbeddingHead NECK_FEAT: after NORM: BN NUM_CLASSES: 0 POOL_LAYER: GeneralizedMeanPooling SCALE: 64 WITH_BNNECK: True LOSSES: CE: ALPHA: 0.2 EPSILON: 0.1 SCALE: 1.0 CIRCLE: GAMMA: 128 MARGIN: 0.25 SCALE: 1.0 COSFACE: GAMMA: 128 MARGIN: 0.25 SCALE: 1.0 FL: ALPHA: 0.25 GAMMA: 2 SCALE: 1.0 NAME: ('CrossEntropyLoss', 'TripletLoss') TRI: HARD_MINING: True MARGIN: 0.0 NORM_FEAT: False SCALE: 1.0 META_ARCHITECTURE: Baseline PIXEL_MEAN: [123.675, 116.28, 103.53] PIXEL_STD: [58.395, 57.120000000000005, 57.375] QUEUE_SIZE: 8192 WEIGHTS: OUTPUT_DIR: logs/visdrone/sbs_R50-ibn SOLVER: AMP: ENABLED: True BASE_LR: 0.0001 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 10 CLIP_GRADIENTS: CLIP_TYPE: norm CLIP_VALUE: 5.0 ENABLED: False NORM_TYPE: 2.0 DELAY_EPOCHS: 30 ETA_MIN_LR: 7.7e-05 FREEZE_ITERS: 3000 GAMMA: 0.1 HEADS_LR_FACTOR: 1.0 IMS_PER_BATCH: 128 MAX_EPOCH: 10 MOMENTUM: 0.9 NESTEROV: False OPT: SGD SCHED: CosineAnnealingLR STEPS: [40, 90] WARMUP_FACTOR: 0.1 WARMUP_ITERS: 3000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0005 WEIGHT_DECAY_BIAS: 0.0005 WEIGHT_DECAY_NORM: 0.0005 TEST: AQE: ALPHA: 3.0 ENABLED: False QE_K: 5 QE_TIME: 1 EVAL_PERIOD: 10 FLIP: ENABLED: False IMS_PER_BATCH: 256 METRIC: cosine PRECISE_BN: DATASET: Market1501 ENABLED: False NUM_ITER: 300 RERANK: ENABLED: False K1: 20 K2: 6 LAMBDA: 0.3 ROC: ENABLED: False [04/06 13:08:43 fastreid]: Full config saved to D:\zhuangshilin\BoT_SORT\fast_reid\logs\visdrone\sbs_R50-ibn\config.yaml D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\transforms\transforms.py:330: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum. "Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "



## Expected behavior:
之后程序就卡在这里不再更新log了,查看gpu也只有10%并没有跑起来,尝试在自己写的dataset.py里面print也是跟在后面显示出来后就没有再进一步,想知道怎么才能找到程序究竟卡在哪里
yonadance commented 5 months ago

设置断点调试后发现卡在了: fastreid.engine.train_loop 中的 class AMPTrainer中的 super().__init__(model, data_loader, optimizer, param_wrapper) 无法执行下去

yonadance commented 5 months ago

修改IMS_PER_BATCH后可以了,但是多个iter之后loss还是=0

yonadance commented 5 months ago

提问:数据集的id如果为1会有什么问题呢

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.