facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.51k stars 1.2k forks source link

Error when trying to train X3D model using AVA dataset #605

Open saicharithpasula opened 2 years ago

saicharithpasula commented 2 years ago

Hello,

I am getting this error when I am trying to train a X3D model using AVA dataset.

File "tools/run_net.py", line 45, in main() File "tools/run_net.py", line 26, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/daemon/code/Users/saicharithreddy.pasula/SCOUT-%20Behavior%20Anomaly%20Detection/trainer/SlowFast/slowfast/utils/misc.py", line 296, in launch_job torch.multiprocessing.spawn( File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/daemon/code/Users/saicharithreddy.pasula/SCOUT-%20Behavior%20Anomaly%20Detection/trainer/SlowFast/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/daemon/code/Users/saicharithreddy.pasula/SCOUT-%20Behavior%20Anomaly%20Detection/trainer/SlowFast/tools/train_net.py", line 711, in train train_epoch( File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/daemon/code/Users/saicharithreddy.pasula/SCOUT-%20Behavior%20Anomaly%20Detection/trainer/SlowFast/tools/train_net.py", line 156, in train_epoch loss = loss_fun(preds, labels) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1164, in forward return F.cross_entropy(input, target, weight=self.weight, File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/functional.py", line 3014, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) TypeError: cross_entropy_loss(): argument 'input' (position 1) must be Tensor, not list

It looks like the model is returning the predictions as a python list instead of a pytorch tensor. Did anyone encounter this error?

P.S: The config I am using is

TRAIN: ENABLE: True DATASET: ava BATCH_SIZE: 16 EVAL_PERIOD: 1 CHECKPOINT_PERIOD: 1 AUTO_RESUME: True CHECKPOINT_FILE_PATH: 'checkpoints/x3d_l.pyth' CHECKPOINT_TYPE: pytorch CHECKPOINT_EPOCH_RESET: True CHECKPOINT_INFLATE: False MIXED_PRECISION: False

DATA: NUM_FRAMES: 15 SAMPLING_RATE: 6 TRAIN_JITTER_SCALES: [256, 320] TRAIN_CROP_SIZE: 224 TEST_CROP_SIZE: 224 INPUT_CHANNEL_NUM: [3] DECODING_BACKEND: torchvision

DETECTION: ENABLE: True ALIGNED: True

AVA: FRAME_DIR: ‘path/frames' FRAME_LIST_DIR: ‘path/frames_list' ANNOTATION_DIR: ‘path/annotations' DETECTION_SCORE_THRESH: 0.8 TRAIN_PREDICT_BOX_LISTS: [ "ava_train_v2.2.csv", "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv", ] TEST_PREDICT_BOX_LISTS: ["person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]

X3D: WIDTH_FACTOR: 2.0 DEPTH_FACTOR: 2.2 BOTTLENECK_FACTOR: 2.25 DIM_C5: 2048 DIM_C1: 12

RESNET: ZERO_INIT_FINAL_BN: True TRANS_FUNC: x3d_transform STRIDE_1X1: False

BN: USE_PRECISE_STATS: False NUM_BATCHES_PRECISE: 200

SOLVER: BASE_LR: 0.1 BASE_LR_SCALE_NUM_SHARDS: True LR_POLICY: steps_with_relative_lrs STEPS: [0, 10, 15, 20] LRS: [1, 0.1, 0.01, 0.001] MAX_EPOCH: 1 WEIGHT_DECAY: 1e-7 WARMUP_EPOCHS: 5.0 WARMUP_START_LR: 0.000125 OPTIMIZING_METHOD: sgd

MODEL: NUM_CLASSES: 7 ARCH: x3d MODEL_NAME: X3D LOSS_FUNC: cross_entropy DROPOUT_RATE: 0.5 HEAD_ACT: sigmoid

TEST: ENABLE: False DATASET: ava BATCH_SIZE: 1

DATA_LOADER: NUM_WORKERS: 5 PIN_MEMORY: True

NUM_GPUS: 2 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUT_DIR: ./x3d

alpargun commented 1 year ago

Could be a PyTorch version related issue. Which version of PyTorch are you using? Another possibility is that your labels are not in the correct shape

davidfreire commented 1 year ago

You have to manage to connect the pre-trained network to a header. Check out the head_helper.py and this thread https://github.com/facebookresearch/SlowFast/issues/371 where @yurinishikawa proposes an insightful header. Also, check the paper to see what output to expect from this pre-trained network.