facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.26k stars 333 forks source link

Errors when trying to run nearest neighbor evaluation with Dino & XCit architecture #515

Open markbarna opened 2 years ago

markbarna commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

  1. what changes you made (git diff) or what code you wrote

    No changes to code.
  2. what exact command you run: python /tools/nearest_neighbor_test.py config=benchmark/nearest_neighbor/eval_dino_xcit_kNN This is a config I created for the KNN based on https://github.com/facebookresearch/vissl/blob/main/configs/config/pretrain/dino/dino_16gpus_xcit_small_12_p16.yaml using the Imagenette2 dataset as a sanity check. Full config is pasted below. I set the feature extraction parameters based on the documentation: https://vissl.readthedocs.io/en/v0.1.5/evaluations/feature_extraction.html#extract-features-of-the-model-head-output-self-supervised-head

  3. what you observed (including full logs):

--- Logging error ---
Traceback (most recent call last):
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 150, in launch_distributed
    _distributed_worker(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 192, in _distributed_worker
    run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/engine_registry.py", line 86, in run_engine
    engine.run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 39, in run_engine
    extract_main(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 106, in extract_main
    trainer.extract(output_folder=cfg.EXTRACT_FEATURES.OUTPUT_DIR or checkpoint_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 365, in extract
    self._extract_split_features(feat_names, self.task, split, output_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 438, in _extract_split_features
    "input": torch.cat(sample["data"]).cuda(non_blocking=True),
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 224 but got size 96 for tensor number 2 in the list.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 138, in <module>
    hydra_main(overrides=overrides)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 133, in hydra_main
    main(args, config)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 109, in main
    launch_distributed(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
    logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 224 but got size 96 for tensor number 2 in the list.'),)
--- Logging error ---
Traceback (most recent call last):
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 150, in launch_distributed
    _distributed_worker(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 192, in _distributed_worker
    run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/engine_registry.py", line 86, in run_engine
    engine.run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 39, in run_engine
    extract_main(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 106, in extract_main
    trainer.extract(output_folder=cfg.EXTRACT_FEATURES.OUTPUT_DIR or checkpoint_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 365, in extract
    self._extract_split_features(feat_names, self.task, split, output_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 438, in _extract_split_features
    "input": torch.cat(sample["data"]).cuda(non_blocking=True),
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 224 but got size 96 for tensor number 2 in the list.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/home/mbarna/.pyenv/versions/3.8.12/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 138, in <module>
    hydra_main(overrides=overrides)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 133, in hydra_main
    main(args, config)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 109, in main
    launch_distributed(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
    logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 224 but got size 96 for tensor number 2 in the list.'),)
Traceback (most recent call last):
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 138, in <module>
    hydra_main(overrides=overrides)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 133, in hydra_main
    main(args, config)
  File "/home/mbarna/Projects/vissl/tools/nearest_neighbor_test.py", line 109, in main
    launch_distributed(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed
    raise e
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 150, in launch_distributed
    _distributed_worker(
  File "/home/mbarna/Projects/vissl/vissl/utils/distributed_launcher.py", line 192, in _distributed_worker
    run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/engine_registry.py", line 86, in run_engine
    engine.run_engine(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 39, in run_engine
    extract_main(
  File "/home/mbarna/Projects/vissl/vissl/engines/extract_features.py", line 106, in extract_main
    trainer.extract(output_folder=cfg.EXTRACT_FEATURES.OUTPUT_DIR or checkpoint_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 365, in extract
    self._extract_split_features(feat_names, self.task, split, output_folder)
  File "/home/mbarna/Projects/vissl/vissl/trainer/trainer_main.py", line 438, in _extract_split_features
    "input": torch.cat(sample["data"]).cuda(non_blocking=True),
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 224 but got size 96 for tensor number 2 in the list.
  1. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset. Just requires the imagenette2 dataset and the following training config:
CHECKPOINT:
  APPEND_DISTR_RUN_ID: false
  AUTO_RESUME: true
  BACKEND: disk
  CHECKPOINT_FREQUENCY: 1
  CHECKPOINT_ITER_FREQUENCY: -1
  DIR: ./knn-test
  LATEST_CHECKPOINT_RESUME_FILE_NUM: 1
  OVERWRITE_EXISTING: false
  USE_SYMLINK_CHECKPOINT_FOR_RESUME: false
CLUSTERFIT:
  CLUSTER_BACKEND: faiss
  DATA_LIMIT: -1
  DATA_LIMIT_SAMPLING:
    SEED: 0
  FEATURES:
    DATASET_NAME: ''
    DATA_PARTITION: TRAIN
    DIMENSIONALITY_REDUCTION: 0
    EXTRACT: false
    LAYER_NAME: ''
    PATH: .
    TEST_PARTITION: TEST
  NUM_CLUSTERS: 16000
  NUM_ITER: 50
  OUTPUT_DIR: .
DATA:
  DDP_BUCKET_CAP_MB: 25
  ENABLE_ASYNC_GPU_COPY: true
  NUM_DATALOADER_WORKERS: 5
  PIN_MEMORY: true
  TEST:
    BASE_DATASET: generic_ssl
    BATCHSIZE_PER_REPLICA: 16
    COLLATE_FUNCTION: default_collate
    COLLATE_FUNCTION_PARAMS: {}
    COPY_DESTINATION_DIR: ''
    COPY_TO_LOCAL_DISK: false
    DATASET_NAMES:
    - imagenette2
    DATA_LIMIT: -1
    DATA_LIMIT_SAMPLING:
      IS_BALANCED: false
      SEED: 0
      SKIP_NUM_SAMPLES: 0
    DATA_PATHS:
    - /home/mbarna/data/imagenette2/train
    DATA_SOURCES:
    - disk_folder
    DEFAULT_GRAY_IMG_SIZE: 224
    DROP_LAST: false
    ENABLE_QUEUE_DATASET: false
    INPUT_KEY_NAMES:
    - data
    LABEL_PATHS:
    - /home/mbarna/data/imagenette2/train
    LABEL_SOURCES: []
    LABEL_TYPE: sample_index
    MMAP_MODE: false
    NEW_IMG_PATH_PREFIX: ''
    RANDOM_SYNTHETIC_IMAGES: false
    REMOVE_IMG_PATH_PREFIX: ''
    TARGET_KEY_NAMES:
    - label
    TRANSFORMS:
    - name: Resize
      size: 256
    - name: CenterCrop
      size: 224
    - name: ToTensor
    - mean:
      - 0.485
      - 0.456
      - 0.406
      name: Normalize
      std:
      - 0.229
      - 0.224
      - 0.225
    USE_DEBUGGING_SAMPLER: false
    USE_STATEFUL_DISTRIBUTED_SAMPLER: false
  TRAIN:
    BASE_DATASET: generic_ssl
    BATCHSIZE_PER_REPLICA: 16
    COLLATE_FUNCTION: default_collate
    COLLATE_FUNCTION_PARAMS: {}
    COPY_DESTINATION_DIR: ''
    COPY_TO_LOCAL_DISK: false
    DATASET_NAMES:
    - imagenette2
    DATA_LIMIT: -1
    DATA_LIMIT_SAMPLING:
      IS_BALANCED: false
      SEED: 0
      SKIP_NUM_SAMPLES: 0
    DATA_PATHS:
    - /home/mbarna/data/imagenette2/train
    DATA_SOURCES:
    - disk_folder
    DEFAULT_GRAY_IMG_SIZE: 224
    DROP_LAST: false
    ENABLE_QUEUE_DATASET: false
    INPUT_KEY_NAMES:
    - data
    LABEL_PATHS:
    - /home/mbarna/data/imagenette2/train
    LABEL_SOURCES:
    - disk_folder
    LABEL_TYPE: standard
    MMAP_MODE: false
    NEW_IMG_PATH_PREFIX: ''
    RANDOM_SYNTHETIC_IMAGES: false
    REMOVE_IMG_PATH_PREFIX: ''
    TARGET_KEY_NAMES:
    - label
    TRANSFORMS:
    - crop_scales:
      - - 0.3
        - 1
      - - 0.05
        - 0.3
      name: ImgPilToMultiCrop
      num_crops:
      - 2
      - 8
      size_crops:
      - 224
      - 96
      total_num_crops: 10
    - name: RandomHorizontalFlip
      p: 0.5
    - name: ImgPilColorDistortion
      strength: 0.5
    - name: ImgPilMultiCropRandomApply
      prob:
      - 1.0
      - 0.1
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      transforms:
      - name: ImgPilGaussianBlur
        p: 1.0
        radius_max: 2.0
        radius_min: 0.1
    - name: ImgPilMultiCropRandomApply
      prob:
      - 0.0
      - 0.2
      - 0.0
      - 0.0
      - 0
      - 0
      - 0
      - 0
      - 0
      - 0
      transforms:
      - name: ImgPilRandomSolarize
        p: 1.0
    - name: ToTensor
    - mean:
      - 0.485
      - 0.456
      - 0.406
      name: Normalize
      std:
      - 0.229
      - 0.224
      - 0.225
    USE_DEBUGGING_SAMPLER: false
    USE_STATEFUL_DISTRIBUTED_SAMPLER: false
DISTRIBUTED:
  BACKEND: nccl
  BROADCAST_BUFFERS: true
  INIT_METHOD: tcp
  MANUAL_GRADIENT_REDUCTION: false
  NCCL_DEBUG: false
  NCCL_SOCKET_NTHREADS: ''
  NUM_NODES: 1
  NUM_PROC_PER_NODE: 1
  RUN_ID: auto
EXTRACT_FEATURES:
  CHUNK_THRESHOLD: 0
  OUTPUT_DIR: ''
HOOKS:
  CHECK_NAN: true
  LOG_GPU_STATS: true
  MEMORY_SUMMARY:
    DUMP_MEMORY_ON_EXCEPTION: false
    LOG_ITERATION_NUM: 0
    PRINT_MEMORY_SUMMARY: true
  MODEL_COMPLEXITY:
    COMPUTE_COMPLEXITY: false
    INPUT_SHAPE:
    - 3
    - 224
    - 224
  PERF_STATS:
    MONITOR_PERF_STATS: false
    PERF_STAT_FREQUENCY: -1
    ROLLING_BTIME_FREQ: -1
  TENSORBOARD_SETUP:
    EXPERIMENT_LOG_DIR: tensorboard
    FLUSH_EVERY_N_MIN: 5
    LOG_DIR: .
    LOG_PARAMS: true
    LOG_PARAMS_EVERY_N_ITERS: 310
    LOG_PARAMS_GRADIENTS: true
    USE_TENSORBOARD: false
IMG_RETRIEVAL:
  CROP_QUERY_ROI: false
  DATASET_PATH: ''
  DEBUG_MODE: false
  EVAL_BINARY_PATH: ''
  EVAL_DATASET_NAME: Paris
  FEATS_PROCESSING_TYPE: ''
  GEM_POOL_POWER: 4.0
  IMG_SCALINGS:
  - 1
  NORMALIZE_FEATURES: true
  NUM_DATABASE_SAMPLES: -1
  NUM_QUERY_SAMPLES: -1
  NUM_TRAINING_SAMPLES: -1
  N_PCA: 512
  RESIZE_IMG: 1024
  SAVE_FEATURES: false
  SAVE_RETRIEVAL_RANKINGS_SCORES: true
  SIMILARITY_MEASURE: cosine_similarity
  SPATIAL_LEVELS: 3
  TRAIN_DATASET_NAME: Oxford
  TRAIN_PCA_WHITENING: true
  USE_DISTRACTORS: false
  WHITEN_IMG_LIST: ''
LOG_FREQUENCY: 10
LOSS:
  CrossEntropyLoss:
    ignore_index: -1
  barlow_twins_loss:
    embedding_dim: 8192
    lambda_: 0.0051
    scale_loss: 0.024
  bce_logits_multiple_output_single_target:
    normalize_output: false
    reduction: none
    world_size: 1
  cross_entropy_multiple_output_single_target:
    ignore_index: -1
    normalize_output: false
    reduction: mean
    temperature: 1.0
    weight: null
  deepclusterv2_loss:
    BATCHSIZE_PER_REPLICA: 256
    DROP_LAST: true
    kmeans_iters: 10
    memory_params:
      crops_for_mb:
      - 0
      embedding_dim: 128
    num_clusters:
    - 3000
    - 3000
    - 3000
    num_crops: 2
    num_train_samples: -1
    temperature: 0.1
  dino_loss:
    crops_for_teacher:
    - 0
    - 1
    ema_center: 0.9
    momentum: 0.996
    normalize_last_layer: true
    output_dim: 65536
    student_temp: 0.1
    teacher_temp_max: 0.07
    teacher_temp_min: 0.04
    teacher_temp_warmup_iters: 37500
  moco_loss:
    embedding_dim: 128
    momentum: 0.999
    queue_size: 65536
    temperature: 0.2
  multicrop_simclr_info_nce_loss:
    buffer_params:
      effective_batch_size: 4096
      embedding_dim: 128
      world_size: 64
    num_crops: 2
    temperature: 0.1
  name: CrossEntropyLoss
  nce_loss_with_memory:
    loss_type: nce
    loss_weights:
    - 1.0
    memory_params:
      embedding_dim: 128
      memory_size: -1
      momentum: 0.5
      norm_init: true
      update_mem_on_forward: true
    negative_sampling_params:
      num_negatives: 16000
      type: random
    norm_constant: -1
    norm_embedding: true
    num_train_samples: -1
    temperature: 0.07
    update_mem_with_emb_index: -100
  simclr_info_nce_loss:
    buffer_params:
      effective_batch_size: 4096
      embedding_dim: 128
      world_size: 64
    temperature: 0.1
  swav_loss:
    crops_for_assign:
    - 0
    - 1
    embedding_dim: 128
    epsilon: 0.05
    normalize_last_layer: true
    num_crops: 2
    num_iters: 3
    num_prototypes:
    - 3000
    output_dir: .
    queue:
      local_queue_length: 0
      queue_length: 0
      start_iter: 0
    temp_hard_assignment_iters: 0
    temperature: 0.1
    use_double_precision: false
  swav_momentum_loss:
    crops_for_assign:
    - 0
    - 1
    embedding_dim: 128
    epsilon: 0.05
    momentum: 0.99
    momentum_eval_mode_iter_start: 0
    normalize_last_layer: true
    num_crops: 2
    num_iters: 3
    num_prototypes:
    - 3000
    queue:
      local_queue_length: 0
      queue_length: 0
      start_iter: 0
    temperature: 0.1
    use_double_precision: false
MACHINE:
  DEVICE: gpu
METERS:
  accuracy_list_meter:
    meter_names: []
    num_meters: 1
    topk_values:
    - 1
  enable_training_meter: true
  mean_ap_list_meter:
    max_cpu_capacity: -1
    meter_names: []
    num_classes: 9605
    num_meters: 1
  model_output_mask: false
  name: ''
  names: []
  precision_at_k_list_meter:
    meter_names: []
    num_meters: 1
    topk_values:
    - 1
  recall_at_k_list_meter:
    meter_names: []
    num_meters: 1
    topk_values:
    - 1
MODEL:
  ACTIVATION_CHECKPOINTING:
    NUM_ACTIVATION_CHECKPOINTING_SPLITS: 2
    USE_ACTIVATION_CHECKPOINTING: false
  AMP_PARAMS:
    AMP_ARGS:
      opt_level: O1
    AMP_TYPE: apex
    USE_AMP: false
  BASE_MODEL_NAME: multi_input_output_model
  CUDA_CACHE:
    CLEAR_CUDA_CACHE: false
    CLEAR_FREQ: 100
  FEATURE_EVAL_SETTINGS:
    EVAL_MODE_ON: true
    EVAL_TRUNK_AND_HEAD: true
    EXTRACT_TRUNK_FEATURES_ONLY: false
    FREEZE_TRUNK_AND_HEAD: true
    FREEZE_TRUNK_ONLY: false
    LINEAR_EVAL_FEAT_POOL_OPS_MAP: []
    SHOULD_FLATTEN_FEATS: false
  FSDP_CONFIG:
    AUTO_WRAP_THRESHOLD: 0
    bucket_cap_mb: 0
    clear_autocast_cache: true
    compute_dtype: float32
    flatten_parameters: true
    fp32_reduce_scatter: false
    mixed_precision: true
    verbose: true
  GRAD_CLIP:
    MAX_NORM: 1
    NORM_TYPE: 2
    USE_GRAD_CLIP: false
  HEAD:
    BATCHNORM_EPS: 1.0e-05
    BATCHNORM_MOMENTUM: 0.1
    PARAMS:
    - - swav_head
      - activation_name: GELU
        dims:
        - 384
        - 2048
        - 2048
        - 256
        num_clusters:
        - 65536
        return_embeddings: false
        use_bn: false
        use_weight_norm_prototypes: true
    PARAMS_MULTIPLIER: 1.0
  INPUT_TYPE: rgb
  MULTI_INPUT_HEAD_MAPPING: []
  NON_TRAINABLE_PARAMS: []
  SHARDED_DDP_SETUP:
    USE_SDP: false
    reduce_buffer_size: -1
  SINGLE_PASS_EVERY_CROP: false
  SYNC_BN_CONFIG:
    CONVERT_BN_TO_SYNC_BN: false
    GROUP_SIZE: -1
    SYNC_BN_TYPE: pytorch
  TEMP_FROZEN_PARAMS_ITER_MAP: []
  TRUNK:
    CONVIT:
      CLASS_TOKEN_IN_LOCAL_LAYERS: false
      LOCALITY_DIM: 10
      LOCALITY_STRENGTH: 1.0
      N_GPSA_LAYERS: 10
      USE_LOCAL_INIT: true
    EFFICIENT_NETS: {}
    NAME: xcit
    REGNET: {}
    RESNETS:
      DEPTH: 50
      GROUPNORM_GROUPS: 32
      GROUPS: 1
      LAYER4_STRIDE: 2
      NORM: BatchNorm
      STANDARDIZE_CONVOLUTIONS: false
      WIDTH_MULTIPLIER: 1
      WIDTH_PER_GROUP: 64
      ZERO_INIT_RESIDUAL: false
    VISION_TRANSFORMERS:
      ATTENTION_DROPOUT_RATE: 0
      CLASSIFIER: token
      DROPOUT_RATE: 0
      DROP_PATH_RATE: 0
      HIDDEN_DIM: 768
      IMAGE_SIZE: 224
      MLP_DIM: 3072
      NUM_HEADS: 12
      NUM_LAYERS: 12
      PATCH_SIZE: 16
      QKV_BIAS: false
      QK_SCALE: false
      name: null
    XCIT:
      ATTENTION_DROPOUT_RATE: 0
      DROPOUT_RATE: 0
      DROP_PATH_RATE: 0.05
      ETA: 1
      HIDDEN_DIM: 384
      IMAGE_SIZE: 224
      NUM_HEADS: 8
      NUM_LAYERS: 12
      PATCH_SIZE: 16
      QKV_BIAS: true
      QK_SCALE: false
      TOKENS_NORM: true
      name: null
  WEIGHTS_INIT:
    APPEND_PREFIX: ''
    PARAMS_FILE: /home/mbarna/data/pre_trained_weights/vissl/dino_300ep_xcitsmall16.torch
    REMOVE_PREFIX: ''
    SKIP_LAYERS:
    - num_batches_tracked
    STATE_DICT_KEY_NAME: classy_state_dict
  _MODEL_INIT_SEED: 0
MONITORING:
  MONITOR_ACTIVATION_STATISTICS: 0
MULTI_PROCESSING_METHOD: forkserver
NEAREST_NEIGHBOR:
  L2_NORM_FEATS: false
  SIGMA: 0.1
  TOPK: 200
OPTIMIZER:
  betas:
  - 0.9
  - 0.999
  construct_single_param_group_only: false
  head_optimizer_params:
    use_different_lr: false
    use_different_wd: false
    weight_decay: 0.0001
  larc_config:
    clip: false
    eps: 1.0e-08
    trust_coefficient: 0.001
  momentum: 0.9
  name: sgd
  nesterov: false
  non_regularized_parameters: []
  num_epochs: 90
  param_schedulers:
    lr:
      auto_lr_scaling:
        auto_scale: false
        base_lr_batch_size: 256
        base_value: 0.1
        scaling_type: linear
      end_value: 0.0
      interval_scaling: &id001 []
      lengths: &id002 []
      milestones: &id003
      - 30
      - 60
      name: multistep
      schedulers: &id004 []
      start_value: 0.1
      update_interval: epoch
      value: 0.1
      values: &id005
      - 0.1
      - 0.01
      - 0.001
    lr_head:
      auto_lr_scaling:
        auto_scale: false
        base_lr_batch_size: 256
        base_value: 0.1
        scaling_type: linear
      end_value: 0.0
      interval_scaling: *id001
      lengths: *id002
      milestones: *id003
      name: multistep
      schedulers: *id004
      start_value: 0.1
      update_interval: epoch
      value: 0.1
      values: *id005
  regularize_bias: true
  regularize_bn: false
  use_larc: false
  use_zero: false
  weight_decay: 0.0001
PROFILING:
  MEMORY_PROFILING:
    TRACK_BY_LAYER_MEMORY: false
  NUM_ITERATIONS: 10
  OUTPUT_FOLDER: .
  PROFILED_RANKS:
  - 0
  - 1
  RUNTIME_PROFILING:
    LEGACY_PROFILER: false
    PROFILE_CPU: true
    PROFILE_GPU: true
    USE_PROFILER: false
  START_ITERATION: 0
  STOP_TRAINING_AFTER_PROFILING: false
  WARMUP_ITERATIONS: 0
REPRODUCIBILITY:
  CUDDN_DETERMINISTIC: false
SEED_VALUE: 0
SLURM:
  ADDITIONAL_PARAMETERS: {}
  COMMENT: vissl job
  CONSTRAINT: ''
  LOG_FOLDER: .
  MEM_GB: 250
  NAME: vissl
  NUM_CPU_PER_PROC: 8
  PARTITION: ''
  PORT_ID: 40050
  TIME_HOURS: 72
  TIME_MINUTES: 0
  USE_SLURM: false
SVM:
  cls_list: []
  costs:
    base: -1.0
    costs_list:
    - 0.1
    - 0.01
    power_range:
    - 4
    - 20
  cross_val_folds: 3
  dual: true
  force_retrain: false
  loss: squared_hinge
  low_shot:
    dataset_name: voc
    k_values:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 96
    sample_inds:
    - 1
    - 2
    - 3
    - 4
    - 5
  max_iter: 2000
  normalize: true
  penalty: l2
TEST_EVERY_NUM_EPOCH: 1
TEST_MODEL: true
TEST_ONLY: false
TRAINER:
  TASK_NAME: self_supervision_task
  TRAIN_STEP_NAME: standard_train_step
VERBOSE: false

Expected behavior:

The KNN should run.

The issue seems to be something with the dataloader but I haven't been able to trace it down yet.

Environment:

Provide your environment information using the following command:

sys.platform         linux
Python               3.8.12 (default, Oct 12 2021, 11:33:23) [GCC 7.5.0]
numpy                1.19.5
Pillow               8.4.0
vissl                0.1.7-dev.2 @/home/mbarna/Projects/vissl/vissl
GPU available        True
GPU 0,1              Tesla T4
CUDA_HOME            /usr/local/cuda
torchvision          0.11.1+cu111 @/home/mbarna/.pyenv/versions/vissl/lib/python3.8/site-packages/torchvision
hydra                1.0.7 @/home/mbarna/.pyenv/versions/vissl/lib/python3.8/site-packages/hydra
classy_vision        0.7.0.dev @/home/mbarna/.pyenv/versions/vissl/lib/python3.8/site-packages/classy_vision
tensorboard          2.7.0
apex                 0.1 @/home/mbarna/.pyenv/versions/vissl/lib/python3.8/site-packages/apex
cv2                  4.5.4-dev
PyTorch              1.10.0+cu111 @/home/mbarna/.pyenv/versions/vissl/lib/python3.8/site-packages/torch
PyTorch debug build  False
-------------------  ----------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------  --------------------------------
Architecture         x86_64
CPU op-mode(s)       32-bit, 64-bit
Byte Order           Little Endian
CPU(s)               128
On-line CPU(s) list  0-127
Thread(s) per core   2
Core(s) per socket   64
Socket(s)            1
NUMA node(s)         1
Vendor ID            AuthenticAMD
CPU family           23
Model                49
Model name           AMD EPYC 7702P 64-Core Processor
Stepping             0
CPU MHz              1490.776
CPU max MHz          2000.0000
CPU min MHz          1500.0000
BogoMIPS             3999.98
Virtualization       AMD-V
L1d cache            32K
L1i cache            32K
L2 cache             512K
L3 cache             16384K
NUMA node0 CPU(s)    0-127

When to expect Triage

VISSL devs and contributors aim to triage issues asap however, as a general guideline, we ask users to expect triaging in 1-2 weeks.

QuentinDuval commented 2 years ago

Hi @markbarna,

First of all, thanks for using VISSL :)

I had a look at the configuration you are using and the culprit is the multi-crop transformation in the config.DATA.TRAIN. Multi-crop creates a batch with multiple crop sizes, which is not supported in general, but in the case of very specific SSL algorithms that leverage it (SwAV, DINO, etc). It is not supported in kNN evaluation.

To do kNN evaluation, you should change the transforms to something like this:

    TRANSFORMS:
    - name: Resize
      size: 256
    - name: CenterCrop
      size: 224
    - name: ToTensor
    - mean:
      - 0.485
      - 0.456
      - 0.406
      name: Normalize
      std:
      - 0.229
      - 0.224
      - 0.225

I hope this helps. Please tell me if this works better now :)

Thank you, Quentin