Multigrid Charades with only 1 GPU

rosarioscavo commented 3 years ago

Hi, I downloaded the Charades dataset and tried to train the dataset with the command: python tools/run_net.py --cfg configs/Charades/SLOWFAST_16x8_R50_multigrid.yaml DATA.PATH_TO_DATA_DIR ../Charades_v1_rgb/

Considering that I've only 1 GPU, I edited the .yaml NUM_GPUS parameter to 1. So the .yaml file configuration is:

MULTIGRID:
  SHORT_CYCLE: True
  LONG_CYCLE: True
TRAIN:
  ENABLE: True
  DATASET: charades
  BATCH_SIZE: 16
  EVAL_PERIOD: 6
  CHECKPOINT_PERIOD: 6
  AUTO_RESUME: True
  CHECKPOINT_FILE_PATH: SLOWFAST_16x8_R50_multigrid.pkl
  CHECKPOINT_TYPE: pytorch
DATA:
  NUM_FRAMES: 64
  SAMPLING_RATE: 2
  TRAIN_JITTER_SCALES: [256, 340]
  TRAIN_CROP_SIZE: 224
  TEST_CROP_SIZE: 256
  INPUT_CHANNEL_NUM: [3, 3]
  MULTI_LABEL: True
  INV_UNIFORM_SAMPLE: True
  ENSEMBLE_METHOD: max
  REVERSE_INPUT_CHANNEL: True
SLOWFAST:
  ALPHA: 4
  BETA_INV: 8
  FUSION_CONV_CHANNEL_RATIO: 2
  FUSION_KERNEL_SZ: 7
RESNET:
  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]
  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]
  ZERO_INIT_FINAL_BN: True
  WIDTH_PER_GROUP: 64
  NUM_GROUPS: 1
  DEPTH: 50
  TRANS_FUNC: bottleneck_transform
  STRIDE_1X1: False
  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]
NONLOCAL:
  LOCATION: [[[], []], [[], []], [[], []], [[], []]]
  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]
  INSTANTIATION: dot_product
BN:
  USE_PRECISE_STATS: True
  NUM_BATCHES_PRECISE: 200
  NORM_TYPE: sync_batchnorm
  NUM_SYNC_DEVICES: 4
SOLVER:
  BASE_LR: 0.0375
  LR_POLICY: steps_with_relative_lrs
  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
  STEPS: [0, 41, 49]
  MAX_EPOCH: 57
  MOMENTUM: 0.9
  WEIGHT_DECAY: 1e-4
  WARMUP_EPOCHS: 4.0
  WARMUP_START_LR: 0.0001
  OPTIMIZING_METHOD: sgd
MODEL:
  NUM_CLASSES: 157
  ARCH: slowfast
  LOSS_FUNC: bce_logit
  HEAD_ACT: sigmoid
  DROPOUT_RATE: 0.5
TEST:
  ENABLE: True
  DATASET: charades
  BATCH_SIZE: 16
  NUM_ENSEMBLE_VIEWS: 10
  NUM_SPATIAL_CROPS: 3
DATA_LOADER:
  NUM_WORKERS: 8
  PIN_MEMORY: True
NUM_GPUS: 1
NUM_SHARDS: 1
RNG_SEED: 0
OUTPUT_DIR: .
LOG_MODEL_INFO: False

Running the command, I get this error:

File "tools/run_net.py", line 42, in <module>
    main()
  File "tools/run_net.py", line 23, in main
    launch_job(cfg=cfg, init_method=args.init_method, func=train)
  File "/home/rscavo/SlowFast/slowfast/utils/misc.py", line 296, in launch_job
    func(cfg=cfg)
  File "/home/rscavo/SlowFast/tools/train_net.py", line 393, in train
    train_loader = loader.construct_loader(cfg, "train")
  File "/home/rscavo/SlowFast/slowfast/datasets/loader.py", line 88, in construct_loader
    batch_sampler = ShortCycleBatchSampler(
  File "/home/rscavo/SlowFast/slowfast/datasets/multigrid_helper.py", line 20, in __init__
    raise ValueError(
ValueError: sampler should be an instance of torch.utils.data.Sampler, but got sampler=None

The sampler is created using the create_sampler function: https://github.com/facebookresearch/SlowFast/blob/fd41618191d3c21c1ca21a61369ce9917646cf9c/slowfast/datasets/loader.py#L87

the create_sampler is defined as follows: https://github.com/facebookresearch/SlowFast/blob/fd41618191d3c21c1ca21a61369ce9917646cf9c/slowfast/datasets/utils.py#L304-L318

I tried to change

cfg.NUM_GPUS > 1

to

cfg.NUM_GPUS > 0

Doing that the error changed in:

File "tools/run_net.py", line 42, in <module>
    main()
  File "tools/run_net.py", line 23, in main
    launch_job(cfg=cfg, init_method=args.init_method, func=train)
  File "/home/rscavo/SlowFast/slowfast/utils/misc.py", line 296, in launch_job
    func(cfg=cfg)
  File "/home/rscavo/SlowFast/tools/train_net.py", line 393, in train
    train_loader = loader.construct_loader(cfg, "train")
  File "/home/rscavo/SlowFast/slowfast/datasets/loader.py", line 87, in construct_loader
    sampler = utils.create_sampler(dataset, shuffle, cfg)
  File "/home/rscavo/SlowFast/slowfast/datasets/utils.py", line 316, in create_sampler
    sampler = DistributedSampler(dataset) if cfg.NUM_GPUS > 0 else None
  File "/home/rscavo/anaconda3/lib/python3.8/site-packages/torch/utils/data/distributed.py", line 54, in __init__
    num_replicas = dist.get_world_size()
  File "/home/rscavo/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/rscavo/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/rscavo/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 209, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized

Do you know a way to be able to use SlowFast with only one GPU or am I doing something incorrectly? Thank you!

mrevow commented 3 years ago

I ran into a similar but not exactly the same issue when training with multiple nodes with 1 GPU per node (but not with multigrid). My workaround is to always initialize distributed training by commenting out these two lines

This may help with the change you made

rosarioscavo commented 3 years ago

I think the problem was that I tried to use the multigrid with only 1 GPU because with the "standard" file configuration and the "standard" model for charades it worked without any edits.

abhaygargab commented 3 years ago

Hello,

Even i am facing the same problem. I checked in the Multigrid Paper that they have shown results for 1 GPU setting on Kinetics dataset. Any guidance will be helpful. Thank You

PotentialX commented 3 years ago

i think u can change
sampler = DistributedSampler(dataset) if cfg.NUM_GPUS > 1 else None to sampler = DistributedSampler(dataset) if cfg.NUM_GPUS > 1 else RandomSampler(dataset)

it works on my machine

facebookresearch / SlowFast

Multigrid Charades with only 1 GPU #308