Closed ZhaofengSHI closed 6 months ago
Can you show me what your loss looks like?
I have already addressed this issue. It seems like weight decay factor is so high that the loss can't converge, thank you!
I apologize for bothering you again, I ended up not solving the problem after trying and the loss curve looks like this. What is the reason for this, please? Thank you!
What are the hyperparameter you used for training?
Hello! Thanks for your reply!
My running script is: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/run_net.py \ --init_method tcp://localhost:9668 \ --cfg configs/Egtea/MVIT_B_16x4_CONV.yaml \ TRAIN.BATCH_SIZE 16 \ TEST.BATCH_SIZE 64 \ NUM_GPUS 4 \ TRAIN.CHECKPOINT_FILE_PATH /data1/zhaofeng/Ego_Gaze/GLC-main/pretrained/K400_MVIT_B_16x4_CONV.pyth \ OUTPUT_DIR checkpoints/GLC_egtea \ DATA.PATH_PREFIX /data1/zhaofeng/Ego_Gaze/Datasets/EGTEA_Gaze+
The hyperparameter setting is: TRAIN: ENABLE: True DATASET: egteagaze BATCH_SIZE: 12 EVAL_PERIOD: 10 ## CHECKPOINT_PERIOD: 10 ## AUTO_RESUME: False CHECKPOINT_EPOCH_RESET: True DATA: PATH_PREFIX: '/data/egtea_gp' NUM_FRAMES: 8 SAMPLING_RATE: 8 TRAIN_JITTER_SCALES: [256, 320] TRAIN_CROP_SIZE: 256 TEST_CROP_SIZE: 256 INPUT_CHANNEL_NUM: [3] TARGET_FPS: 24 USE_OFFSET_SAMPLING: False GAUSSIAN_KERNEL: 19 MVIT: ZERO_DECAY_POS_CLS: False SEP_POS_EMBED: True DEPTH: 16 NUM_HEADS: 1 EMBED_DIM: 96 PATCH_KERNEL: (3, 7, 7) PATCH_STRIDE: (2, 4, 4) PATCH_PADDING: (1, 3, 3) MLP_RATIO: 4.0 QKV_BIAS: True DROPPATH_RATE: 0.2 NORM: "layernorm" MODE: "conv" CLS_EMBED_ON: False GLOBAL_EMBED_ON: True DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]] HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]] POOL_KVQ_KERNEL: [3, 3, 3] POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8] POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]] DROPOUT_RATE: 0.0 BN: USE_PRECISE_STATS: False NUM_BATCHES_PRECISE: 200 SOLVER: ZERO_WD_1D_PARAM: True CLIP_GRAD_L2NORM: 1.0 BASE_LR_SCALE_NUM_SHARDS: True BASE_LR: 0.0001 # 0.0001 COSINE_AFTER_WARMUP: True COSINE_END_LR: 1e-6 WARMUP_START_LR: 1e-6 WARMUP_EPOCHS: 5.0 #5.0 LR_POLICY: cosine MAX_EPOCH: 25 #25 MOMENTUM: 0.9 WEIGHT_DECAY: 0.05 #0.05 OPTIMIZING_METHOD: adamw MODEL: NUM_CLASSES: 400 ARCH: mvit MODEL_NAME: GLC_Gaze LOSS_FUNC: kldiv DROPOUT_RATE: 0.5 #0.5 TEST: ENABLE: True DATASET: egteagaze BATCH_SIZE: 12 NUM_SPATIAL_CROPS: 1 NUM_ENSEMBLE_VIEWS: 1 DATA_LOADER: NUM_WORKERS: 8 PIN_MEMORY: False TENSORBOARD: ENABLE: True
NUM_GPUS: 4 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUT_DIR: .
Then I got the above result.
That's a weird phenomenon. The training F1 should go much higher, like to more than 0.5. Did you try inference with the released weights? Can you get the same number?
Hi! Thanks for your excellent work, I reproduced the code following the process in the readme.md and found that the loss does not converge. May I ask what the reason is?
The running script is: CUDA_VISIBLE_DEVICES=4,5,6,7 python tools/run_net.py \ --init_method tcp://localhost:9878 \ --cfg configs/Egtea/MVIT_B_16x4_CONV.yaml \ TRAIN.BATCH_SIZE 16 \ TEST.BATCH_SIZE 128 \ NUM_GPUS 4 \ TRAIN.CHECKPOINT_FILE_PATH /data/zhaofeng/Gaze/GLC-main/pretrained/K400_MVIT_B_16x4_CONV.pyth \ OUTPUT_DIR checkpoints/GLC_egtea \ DATA.PATH_PREFIX /data/zhaofeng/Gaze/Datasets/EGTEA_Gaze+
The devices are: 4 NVIDIA TITAN xp GPUs
Thank you very much! Looking forward your reply.