AbrarKhan009 commented 3 months ago

Hello everyone, i hope you all are doing well. i have successfully done slowfast training with MVitv2 on my custom dataset Details of my training are given below.

i used MVITv2_B_32x3.yaml and followed this structure below.

SlowFast/ ├── configs/ │ └── MyData/ │ └── MVITv2_B_32x3.yaml ├── data/ │ └── MyData/ │ ├── ClassA/ │ │ └── ins.mp4 │ ├── ClassB/ │ │ └── kep.mp4 │ ├── ClassC/ │ | └── tak.mp4 │ ├── train.csv │ ├── test.csv │ ├── val.csv │ └── classids.json ├── slowfast/ │ └── datasets/ │ ├── init.py │ ├── mydata.py │ └── ... └── ... all this fine-tuning guidance on your custom dataset is already explained by @AlexanderMelde [here] (https://github.com/facebookresearch/SlowFast/issues/149) thanks to him for his guidance.

My question is I am getting this output in the end, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 Can somebody explain this output to me, _p50.93_f225.17 _t12.31_m10.69 and MEM: 10.69 and f: 225.1698

AbrarKhan009 commented 3 months ago

My training config is given below,

TRAIN: ENABLE: True DATASET: mydata BATCH_SIZE: 1 # bcz of limited resources i am using this batch size EVAL_PERIOD: 5 CHECKPOINT_PERIOD: 5 AUTO_RESUME: True CHECKPOINT_EPOCH_RESET: True CHECKPOINT_FILE_PATH: "/home/mukhan/project/slowfast/MViTv2_B_32x3_k400_f304025456.pyth" CHECKPOINT_IN_INIT: True

DATA: USE_OFFSET_SAMPLING: True DECODING_BACKEND: torchvision
NUM_FRAMES: 32 SAMPLING_RATE: 1 TRAIN_JITTER_SCALES: [256, 320] TRAIN_CROP_SIZE: 224 TEST_CROP_SIZE: 224 INPUT_CHANNEL_NUM: [3] PATH_TO_DATA_DIR: "/home/mukhan/project/slowfast/data/Mydata/" # csv files locations for train and val TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0] TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]

MVIT: ZERO_DECAY_POS_CLS: False USE_ABS_POS: False REL_POS_SPATIAL: True REL_POS_TEMPORAL: True DEPTH: 24 NUM_HEADS: 1 EMBED_DIM: 96 PATCH_KERNEL: (3, 7, 7) PATCH_STRIDE: (2, 4, 4) PATCH_PADDING: (1, 3, 3) MLP_RATIO: 4.0 QKV_BIAS: True DROPPATH_RATE: 0.3 NORM: "layernorm" MODE: "conv" CLS_EMBED_ON: True DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]] HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]] POOL_KVQ_KERNEL: [3, 3, 3] POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8] POOL_Q_STRIDE: [ [0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1], ] DROPOUT_RATE: 0.0 DIM_MUL_IN_ATT: True RESIDUAL_POOLING: True

AUG: NUM_SAMPLE: 2 ENABLE: True COLOR_JITTER: 0.4 AA_TYPE: rand-m7-n4-mstd0.5-inc1 INTERPOLATION: bicubic RE_PROB: 0.25 RE_MODE: pixel RE_COUNT: 1 RE_SPLIT: False

MIXUP: ENABLE: True ALPHA: 0.8 CUTMIX_ALPHA: 1.0 PROB: 1.0 SWITCH_PROB: 0.5 LABEL_SMOOTH_VALUE: 0.1

SOLVER: ZERO_WD_1D_PARAM: True BASE_LR_SCALE_NUM_SHARDS: True CLIP_GRAD_L2NORM: 1.0 BASE_LR: 0.00001 COSINE_AFTER_WARMUP: True COSINE_END_LR: 1e-6 WARMUP_START_LR: 1e-6 WARMUP_EPOCHS: 30.0 LR_POLICY: cosine MAX_EPOCH: 50 MOMENTUM: 0.9 WEIGHT_DECAY: 0.05 OPTIMIZING_METHOD: adamw

MODEL: NUM_CLASSES: 15 ARCH: mvit MODEL_NAME: MViT LOSS_FUNC: soft_cross_entropy DROPOUT_RATE: 0.5

TEST: ENABLE: False DATASET: mydata BATCH_SIZE: 64 NUM_SPATIAL_CROPS: 1 NUM_ENSEMBLE_VIEWS: 5

DATA_LOADER: NUM_WORKERS: 8 PIN_MEMORY: True

NUM_GPUS: 1 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUT_DIR: "/home/mukhan/project/slowfast/output"

TENSORBOARD: ENABLE: True LOG_DIR: "/home/mukhan/project/slowfast/output/runs" # Leave empty to use cfg.OUTPUT_DIR/runs-{cfg.TRAIN.DATASET} as path. CLASS_NAMES_PATH: "/home/mukhan/project/slowfast/data/Mydata/classnames.json" # Path to json file providing class_name - id mapping. CONFUSION_MATRIX: ENABLE: True SUBSET_PATH: "/home/mukhan/project/slowfast/data/Mydata/classnames.txt" # Path to txt file contains class names separated by newline characters.

Only classes in this file will be visualized in the confusion matrix.

AbrarKhan009 commented 3 months ago

i have some question regarding the outputs, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 can somebody explain this final message to me.

also why during the training lr is always 0.000

AbrarKhan009 commented 3 months ago

These are the graphs and confusion matrix i got after running 50 epochs Confusion_Matrix

I understand that these results are not satisfactory. Could anyone of you please advise on how I can improve them? Specifically, I would like to know which parameters or aspects of the model training process I should consider adjusting to achieve better performance. Any suggestions or recommendations would be greatly appreciated.

alpargun commented 3 months ago

i have some question regarding the outputs, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 can somebody explain this final message to me.

also why during the training lr is always 0.000

answered here: https://github.com/facebookresearch/SlowFast/issues/664#issuecomment-2270738940

Also, I would immediately say batch size = 1 can be one of the limitations for a vision task regarding you dissatisfaction.

AbrarKhan009 commented 3 months ago

i have some question regarding the outputs, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 can somebody explain this final message to me. also why during the training lr is always 0.000

answered here: #664 (comment)

Also, I would immediately say batch size = 1 can be one of the limitations for a vision task regarding you dissatisfaction.

thanks for the advice, now i have increased the batch size from 1 to 2 and start the training again for 100 epochs i will update about the result here after training is done.

alpargun commented 3 months ago

How big is your custom dataset? If also your dataset is limited, as relatively small ConvNet can also achieve the task. I understand training a transformer model can be resource-wise demanding, and you can try a batch size of 16, 32, or even 64 with a ConvNet to achieve a better performance because a batch size of 2 is still on the far low side.

i have some question regarding the outputs, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 can somebody explain this final message to me.

also why during the training lr is always 0.000

Can you also show the output where it says your LR is 0.000? Looking at your config, your LR is set as 0.00001 and this value might have been simply clipped by the display precision in the output terminal, so it does not look concerning as your accuracy already is improving over time and it is not possible that you indeed have a zero LR, which would yield zero weight updates.

AbrarKhan009 commented 3 months ago

How big is your custom dataset? If also your dataset is limited, as relatively small ConvNet can also achieve the task. I understand training a transformer model can be resource-wise demanding, and you can try a batch size of 16, 32, or even 64 with a ConvNet to achieve a better performance because a batch size of 2 is still on the far low side.

i have some question regarding the outputs, train_net.py: 759: training done: _p50.93_f225.17 _t12.31_m10.69 _a25.00 Top5 Acc: 66.67 MEM: 10.69 f: 225.1698 can somebody explain this final message to me. also why during the training lr is always 0.000

Can you also show the output where it says your LR is 0.000? Looking at your config, your LR is set as 0.00001 and this value might have been simply clipped by the display precision in the output terminal, so it does not look concerning as your accuracy already is improving over time and it is not possible that you indeed have a zero LR, which would yield zero weight updates.

I have a synthetic dataset consisting of 15 classes of human activities. For each class, I have around 40 videos. My task is to train a vision transformer model for human activity recognition. After training, I will test it on a real-world dataset where I have 5 videos for each class.

While I understand that using a ConvNet might be more resource-efficient, my task is domain-specific and requires the use of a vision transformer model, regardless of the initial results. Therefore, I need to focus on improving the performance of the vision transformer rather than switching to a CNN.

Any suggestions for optimizing the vision transformer to achieve better results would be greatly appreciated

alpargun commented 3 months ago

The simplest things you can start with are:

increase your training batch size as high as your hardware allows
train for more epochs as your train loss still has not converged in the plots you provided. Train until your train loss does not decrease anymore or your validation loss started increasing (due to overfitting)

AbrarKhan009 commented 3 months ago

Update On my Training after 105 Epochs these are the below results i got on Kinetics/MVITv2_B_32x3

train_net.py: 759: training done: _p50.93_f225.17 _t11.09_m20.68 _a19.17 Top5 Acc: 60.00 MEM: 20.68 f: 225.1698

confusion_matrix_104Epochs

@alpargun due to resource limitations i cant increase my batch size from 2, should i continue this training for more epochs or should i try other models like Kinetics/MVITv2_S_16x4 or Kinetics/MVITv2_L_40x3_test ?

Also if i want to try SSv2 (https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md#ssv2) model which is pretrain on k400 what changes i need to do in my current dataset setteings ? Any suggestions to achieve better results would be greatly appreciated

alpargun commented 3 months ago

@AbrarKhan009 You can start with the smallest version, MVITv2_S_16x4, as a baseline with a batch size as high as your hardware allows and continue the training until either training error converges or validation error starts increasing (overtraining). So please do not set 50 as the max epochs and keep it 200 as the original config. Do not worry, MVITv2_S can still handle your task to classify 15 actions since it had 81% top-1 accuracy on K400 with 400 different actions. This model also uses only 16 input frames, compared to your old model's 32 frames. So, I hope you can have a higher batch size and a faster training.

After a finished training, feel free to restore your last checkpoint for MVITv2_B_32x3 to directly continue your old training instead of starting from scratch to save time. So, you can compare the results for both transformer models.

Regarding trying SSv2 with the model pretrained on K400 can be a good way to test if the problem is due to your custom dataset. However, this will need you to download SSv2 and prepare the folder structure according to the implementation in the file slowfast/datasets/ssv2.py. Furthermore, you need to modify the config file so that the output classes (NUM_CLASSES) will match SSv2's number of classes (just like what you did for your custom dataset.

AbrarKhan009 commented 3 months ago

@AbrarKhan009 You can start with the smallest version, MVITv2_S_16x4, as a baseline with a batch size as high as your hardware allows and continue the training until either training error converges or validation error starts increasing (overtraining). So please do not set 50 as the max epochs and keep it 200 as the original config. Do not worry, MVITv2_S can still handle your task to classify 15 actions since it had 81% top-1 accuracy on K400 with 400 different actions. This model also uses only 16 input frames, compared to your old model's 32 frames. So, I hope you can have a higher batch size and a faster training.

After a finished training, feel free to restore your last checkpoint for MVITv2_B_32x3 to directly continue your old training instead of starting from scratch to save time. So, you can compare the results for both transformer models.

Regarding trying SSv2 with the model pretrained on K400 can be a good way to test if the problem is due to your custom dataset. However, this will need you to download SSv2 and prepare the folder structure according to the implementation in the file slowfast/datasets/ssv2.py. Furthermore, you need to modify the config file so that the output classes (NUM_CLASSES) will match SSv2's number of classes (just like what you did for your custom dataset.

Thanks for the suggestion. I will follow your advice. I tried training MVITv2_S_16x4 with batch sizes of 16 and 8, but both gave me a CUDA out of Memory error. Now, I’m using a batch size of 4, and it’s running smoothly.

AbrarKhan009 commented 3 months ago

@alpargun Hi good morning, my training of MVITv2_S_16x4 with batch sizes of 4 for 200 Epochs is done and these are the results i got : train_net.py: 759: training done: _p34.24_f64.46 _t2.96_m11.96 _a37.50 Top5 Acc: 69.17 MEM: 11.96 f: 64.4566

Tensorboard_results_Epoch_200

.... Confusion-Matrix-Epoch-200

should i continue this training for more epochs? these results are goods as compare to the Base model thanks for your advice.

facebookresearch / SlowFast

Successful Training done on Custom dataset, but have some Question about the output #716

Only classes in this file will be visualized in the confusion matrix.