Very long estimated running time

thongnt99 commented 3 years ago

I tried to train Jigsaw model on ImageNet-1k data following the guide here. However, the estimated time to finish training is very long (many days). I have tried to run on different types of GPU, on both one and mutiple GPUs, but the problem still persists. This estimated running time is much longer than the time reported in the Jigsaw paper. Do you have any idea of what I have done wrong? Thanks in advance.

SJShin-AI commented 3 years ago

For pre-training simCLR with imageNet-1K, i also encountered the same problem.

For the understanding of the problem, we attach the yaml file.

config: VERBOSE: False LOG_FREQUENCY: 10000 TEST_ONLY: False TEST_MODEL: False SEED_VALUE: 0 MULTI_PROCESSING_METHOD: forkserver MONITOR_PERF_STATS: True PERF_STAT_FREQUENCY: 10 ROLLING_BTIME_FREQ: 5 DATA: NUM_DATALOADER_WORKERS: 5 TRAIN: DATA_SOURCES: [disk_folder] DATASET_NAMES: [imagenet1k_folder] BATCHSIZE_PER_REPLICA: 128 LABEL_TYPE: sample_index # just an implementation detail. Label isn't used TRANSFORMS:

name: ImgReplicatePil num_times: 2
name: RandomResizedCrop size: 224
name: RandomHorizontalFlip p: 0.5
name: ImgPilColorDistortion strength: 1.0
name: ImgPilGaussianBlur p: 0.5 radius_min: 0.1 radius_max: 2.0
name: ToTensor
name: Normalize mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] COLLATE_FUNCTION: simclr_collator MMAP_MODE: True COPY_TO_LOCAL_DISK: False DROP_LAST: True COPY_DESTINATION_DIR: "/tmp/imagenet1k" TRAINER: TRAIN_STEP_NAME: standard_train_step METERS: name: "" MODEL: TRUNK: NAME: resnet RESNETS: DEPTH: 50 HEAD: PARAMS: [ ["mlp", {"dims": [2048, 2048], "use_relu": True}], ["mlp", {"dims": [2048, 128]}], ] SYNC_BN_CONFIG: CONVERT_BN_TO_SYNC_BN: True SYNC_BN_TYPE: pytorch AMP_PARAMS: USE_AMP: False AMP_ARGS: {"opt_level": "O3", "keep_batchnorm_fp32": True, "master_weights": True, "loss_scale": "dynamic"} LOSS: name: simclr_info_nce_loss simclr_info_nce_loss: temperature: 0.1 buffer_params: embedding_dim: 128 OPTIMIZER: name: sgd use_larc: True larc_config: clip: False trust_coefficient: 0.001 eps: 0.00000001 weight_decay: 0.000001 momentum: 0.9 nesterov: False num_epochs: 300 regularize_bn: False regularize_bias: True head_optimizer_params: use_different_lr: False use_different_wd: False param_schedulers: lr: auto_lr_scaling: auto_scale: false base_value: 0.3 base_lr_batch_size: 256 name: composite schedulers:
- name: linear start_value: 0.6 end_value: 4.8
- name: cosine_warm_restart start_value: 4.8 end_value: 0.0048
  wave_type: half
```
  # restart_interval_length: 0.5
  wave_type: full
  is_adaptive: True
  restart_interval_length: 0.334
```
  interval_scaling: [rescaled, rescaled] update_interval: step lengths: [0.1, 0.9] # 100ep DISTRIBUTED: BACKEND: nccl NUM_NODES: 1 NUM_PROC_PER_NODE: 1 INIT_METHOD: tcp RUN_ID: auto MACHINE: DEVICE: gpu CHECKPOINT: DIR: "." AUTO_RESUME: True CHECKPOINT_FREQUENCY: 1 OVERWRITE_EXISTING: true

prigoyal commented 3 years ago

thank you for reaching out. Could you share some details about what machine (GPUs ), yaml config, you are using?

thongnt99 commented 3 years ago

Hi @prigoyal , thanks for your response. I have tested on different GPUs, including A100, GTX600, Tesla M40. And this is the configuration file:

# @package _global_
config:
  VERBOSE: True
  LOG_FREQUENCY: 100
  TEST_ONLY: False
  TEST_MODEL: False
  SEED_VALUE: 0
  MULTI_PROCESSING_METHOD: forkserver
  MONITOR_PERF_STATS: TRUE
  PERF_STAT_FREQUENCY: 10
  ROLLIN_BTIME_FREQ: 5
  DATA:
    NUM_DATALOADER_WORKERS: 5
    TRAIN:
      DATA_SOURCES: [disk_folder]
      DATASET_NAMES: [imagenet1k_folder]
      BATCHSIZE_PER_REPLICA: 32
      LABEL_TYPE: sample_index # isn't used
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: RandomHorizontalFlip
        - name: RandomCrop
          size: 255
        - name: RandomGrayscale
          p: 0.66
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
        - name: ImgPatchesFromTensor
          num_patches: 9
          patch_jitter: 21
        - name: ShuffleImgPatches
          perm_file: https://dl.fbaipublicfiles.com/fair_self_supervision_benchmark/jigsaw_permutations/hamming_perms_2000_patches_9_max_avg.npy   # perm 2K
      COLLATE_FUNCTION: siamese_collator
      MMAP_MODE: True
      COPY_TO_LOCAL_DISK: False
  METERS:
    name: accuracy_list_meter
    accuracy_list_meter:
      num_meters: 1
      topk_values: [1]
  TRAINER:
    TRAIN_STEP_NAME: standard_train_step
  MODEL:
    TRUNK:
      NAME: resnet
      RESNETS:
        DEPTH: 50
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [2048, 1000], "use_bn": True, "use_relu": True, "skip_last_layer_relu_bn": False}],
        ["siamese_concat_view", {"num_towers": 9}],
        ["mlp", {"dims": [9000, 2000]}],    # perm 2K
      ]
    SYNC_BN_CONFIG:
      CONVERT_BN_TO_SYNC_BN: True
      SYNC_BN_TYPE: pytorch
    AMP_PARAMS:
      USE_AMP: False
      AMP_ARGS: {"opt_level": "03", "keep_batchnorm_fp32": True, "master_weights": True, "loss_scale": "dynamic"}
  LOSS:
    name: cross_entropy_multiple_output_single_target
    cross_entropy_multiple_output_single_target:
      ignore_index: -1
  OPTIMIZER:
      name: sgd
      use_larc: True
      larc_config:
              clip: False
              trust_coefficient: 0.001
              eps: 0.000001
      weight_decay: 0.0001
      momentum: 0.9
      num_epochs: 105
      nesterov: False
      regularize_bn: False
      regularize_bias: True
      param_schedulers:
        lr:
          auto_lr_scaling:
            auto_scale: true
            base_value: 0.1
            base_lr_batch_size: 256
          name: composite
          schedulers:
            - name: linear
              start_value: 0.025
              end_value: 0.1
            - name: multistep
              values: [0.1, 0.01, 0.001, 0.0001, 0.00001]
              milestones: [30, 60, 90, 100]
          update_interval: epoch
          interval_scaling: [rescaled, fixed]
          lengths: [0.047619, 0.952381]
  DISTRIBUTED:
    BACKEND: nccl
    NUM_NODES: 1
    NUM_PROC_PER_NODE: 1
    INIT_METHOD: tcp
    RUN_ID: auto
  MACHINE:
    DEVICE: gpu
  CHECKPOINT:
    DIR: "."
    AUTO_RESUME: True
    CHECKPOINT_FREQUENCY: 1
    OVERWRITE_EXISTING: true

iseessel commented 3 years ago

Hi there Thac-Thong can you let us know how long you are expecting the training to take?

Based on the config you provided, your batch size is 32 -- the batch size is 256 in the paper. Can you cross reference all your hyper-params with the original jigsaw paper and make sure they match, and if possible up the batch size to 256?

(Also please note the ETA at the very beginning of a training will usually be longer than it will be, it will stabilize after 800+ iterations).

Just for reference:

BATCHSIZE_PER_REPLICA, controls the batch size. NUM_NODES: 1 controls the number of nodes and NUM_PROC_PER_NODE: 1 controls the number of gpus used.

facebookresearch / vissl

Very long estimated running time #360

wave_type: half