facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

Trouble when using moco_loss and moco_collator #534

Closed mcwindy closed 1 year ago

mcwindy commented 2 years ago

Instructions To Reproduce the πŸ› Bug:

  1. what changes youss made (git diff) or what code you wrote

Modified the main function in run_distributed_engines.py to

def my_register_data():
    from vissl.data import VisslDatasetCatalog
    VisslDatasetCatalog.register_json('/home/mcwindy/vissltest/configs/config/dataset_catalog.json')

if __name__ == "__main__":
    my_register_data()
    """
    Example usage:

    `python tools/run_distributed_engines.py config=test/integration_test/quick_simclr`
    """
    overrides = sys.argv[1:]
    assert is_hydra_available(), "Make sure to install hydra"
    overrides.append("hydra.verbose=true")
    hydra_main(overrides=overrides)

Modified supervised_1gpu_resnet_example as follows:

# @package _global_
config:
  CHECKPOINT:
    DIR: "checkpoints1"
    AUTO_RESUME: True
    CHECKPOINT_FREQUENCY: 1
  DATA:
    NUM_DATALOADER_WORKERS: 5
    TRAIN:
      DATA_SOURCES: [disk_folder]
      DATA_PATHS: []
      LABEL_SOURCES: [disk_folder]
      DATASET_NAMES: [disk_folder]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: RandomResizedCrop
          size: 224
        - name: RandomHorizontalFlip
        - name: ColorJitter
          brightness: 0.4
          contrast: 0.4
          saturation: 0.4
          hue: 0.4
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
      COLLATE_FUNCTION: "moco_collator"
      COLLATE_FUNCTION_PARAMS: {}
    TEST:
      DATA_SOURCES: [disk_folder]
      DATA_PATHS: []
      LABEL_SOURCES: [disk_folder]
      DATASET_NAMES: [disk_folder]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: CenterCrop
          size: 224
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
  MODEL:
    TRUNK:
      NAME: resnet
      TRUNK_PARAMS:
        RESNETS:
          DEPTH: 50
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [2048, 1000]}],
      ]
  LOSS:
    name: moco_loss
    moco_loss:
      embedding_dim: 128
      momentum: 0.999
      queue_size: 65536
      temperature: 0.2
  OPTIMIZER:
      name: sgd
      weight_decay: 0.0001
      momentum: 0.9
      num_epochs: 105
      nesterov: True
      regularize_bn: False
      regularize_bias: True
      param_schedulers:
        lr:
          auto_lr_scaling: # learning rate is automatically scaled based on batch size
            auto_scale: true
            base_value: 0.1
            base_lr_batch_size: 256 # learning rate of 0.1 is used for batch size of 256
          name: multistep
          # We want the learning rate to drop by 1/10
          # at epochs [30, 60, 90, 100]
          milestones: [30, 60, 90, 100] # epochs at which to drop the learning rate (N vals)
          values: [0.1, 0.01, 0.001, 0.0001, 0.00001] # the exact values of learning rate (N+1 vals)
          update_interval: epoch
  METERS:
    name: accuracy_list_meter
    accuracy_list_meter:
      num_meters: 1
      topk_values: [1, 5]
  TRAINER:
    TRAIN_STEP_NAME: standard_train_step
  DISTRIBUTED:
    BACKEND: nccl
    NUM_NODES: 1
    NUM_PROC_PER_NODE: 1 # 1 GPU
    RUN_ID: auto
  MACHINE:
    DEVICE: gpu
  VERBOSE: True
  LOG_FREQUENCY: 100
  TEST_ONLY: False
  TEST_EVERY_NUM_EPOCH: 1
  TEST_MODEL: True
  SEED_VALUE: 0
  MULTI_PROCESSING_METHOD: fork
  1. what exact command you run:

    python3 ./tools/run_distributed_engines.py hydra.verbose=true config=supervised_1gpu_resnet_example config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=True config.DATA.TRAIN.DATASET_NAMES="[dummy_data_folder]" config.DATA.TEST.DATASET_NAMES="[dummy_data_folder]"
  2. what you observed (including full logs):

    
    ** fvcore version of PathManager will be deprecated soon. **
    ** Please migrate to the version in iopath repo. **
    https://github.com/facebookresearch/iopath 

/home/mcwindy/.local/lib/python3.9/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms.functional' module instead. warnings.warn( /home/mcwindy/.local/lib/python3.9/site-packages/torchvision/transforms/_transforms_video.py:25: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in 0.14. Please use the 'torchvision.transforms' module instead. warnings.warn( ####### overrides: ['hydra.verbose=true', 'config=supervised_1gpu_resnet_example', 'config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=True', 'config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]', 'hydra.verbose=true'] INFO 2022-03-29 23:39:29,052 init.py: 37: Provided Config has latest version: 1 INFO 2022-03-29 23:39:29,053 io.py: 63: Saving data to file: checkpoints1/train_config.yaml INFO 2022-03-29 23:39:29,072 io.py: 89: Saved data to file: checkpoints1/train_config.yaml INFO 2022-03-29 23:39:29,072 run_distributed_engines.py: 162: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:50653 INFO 2022-03-29 23:39:29,072 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2022-03-29 23:39:29,073 env.py: 50: ALL_PROXY: INFO 2022-03-29 23:39:29,073 env.py: 50: COLORTERM: truecolor INFO 2022-03-29 23:39:29,073 env.py: 50: CPLUS_INCLUDE_PATH: /usr/local/include/python3.8/ INFO 2022-03-29 23:39:29,073 env.py: 50: CUDA_PATH: /usr/local/cuda-11.5/targets/x86_64-linux/include/ INFO 2022-03-29 23:39:29,073 env.py: 50: C_INCLUDE_PATH: /usr/local/include/python3.8/ INFO 2022-03-29 23:39:29,073 env.py: 50: DISPLAY: :0 INFO 2022-03-29 23:39:29,073 env.py: 50: GIT_ASKPASS: /home/mcwindy/.vscode-server/bin/c722ca6c7eed3d7987c0d5c3df5c45f6b15e77d1/extensions/git/dist/askpass.sh INFO 2022-03-29 23:39:29,073 env.py: 50: HOME: /home/mcwindy INFO 2022-03-29 23:39:29,073 env.py: 50: HOSTTYPE: x86_64 INFO 2022-03-29 23:39:29,073 env.py: 50: HTTPS_proxy: http://172.28.0.1:7890 INFO 2022-03-29 23:39:29,073 env.py: 50: HTTP_PROXY: http://172.28.0.1:7890 INFO 2022-03-29 23:39:29,073 env.py: 50: LANG: C.UTF-8 INFO 2022-03-29 23:39:29,073 env.py: 50: LESS: -R INFO 2022-03-29 23:39:29,073 env.py: 50: LOCAL_RANK: 0 INFO 2022-03-29 23:39:29,073 env.py: 50: LOGNAME: mcwindy INFO 2022-03-29 23:39:29,073 env.py: 50: LSCOLORS: Gxfxcxdxbxegedabagacad INFO 2022-03-29 23:39:29,074 env.py: 50: LS_COLORS: rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.webp=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36: INFO 2022-03-29 23:39:29,074 env.py: 50: NAME: mcwindy_pc INFO 2022-03-29 23:39:29,074 env.py: 50: OLDPWD: /home/mcwindy INFO 2022-03-29 23:39:29,074 env.py: 50: PAGER: less INFO 2022-03-29 23:39:29,074 env.py: 50: PATH: /home/mcwindy/.vscode-server/bin/c722ca6c7eed3d7987c0d5c3df5c45f6b15e77d1/bin/remote-cli:/home/mcwindy/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Program Files (x86)/VMware/VMware Workstation/bin/:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem/:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/ProgramData/chocolatey/bin/:/mnt/c/tools/adb/:/mnt/c/tools/minio/:/mnt/c/Users/mcwindy/.cargo/bin/:/mnt/c/Users/mcwindy/.jdks/openjdk-17.0.1/bin/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python38/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python310/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Microsoft VS Code/bin/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python38/Lib/site-packages/torch/lib/:/mnt/c/Program Files/dotnet/:/mnt/c/Program Files/Git/cmd/:/mnt/c/Program Files/WireGuard/:/mnt/c/ProgramData/DockerDesktop/version-bin/:/mnt/c/Program Files/Docker/Docker/resources/bin/:/mnt/c/Program Files/Oculus/Support/oculus-runtime/:/mnt/c/Program Files/Common Files/Oracle/Java/javapath/:/mnt/c/Program Files/NVIDIA Corporation/NVIDIA NvDLISR/:/mnt/c/Program Files/Microsoft Visual Studio/2022/Professional/VC/Tools/MSVC/14.31.31103/bin/Hostx64/x64/:/mnt/c/Program Files (x86)/Common Files/Oracle/Java/javapath/:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common/:/mnt/c/Users/mcwindy/Desktop/videos/:/mnt/c/tools/ffmpeg 5.0/bin/:/mnt/c/tools/TDM-GCC/bin:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common:/mnt/c/Program Files/nodejs/:/mnt/c/Program Files/Docker/Docker/resources/bin:/mnt/c/ProgramData/DockerDesktop/version-bin:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python38/Scripts/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python38/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python310/Scripts/:/mnt/c/Users/mcwindy/AppData/Local/Programs/Python/Python310/:/mnt/c/Users/mcwindy/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/mcwindy/AppData/Local/Programs/Microsoft VS Code/bin:/mnt/c/Users/mcwindy/.dotnet/tools:/mnt/c/Program Files/JetBrains/IntelliJ IDEA/bin:/mnt/c/Program Files/JetBrains/PyCharm/bin:/mnt/c/Users/mcwindy/AppData/Local/Programs/Fiddler:/mnt/c/Users/mcwindy/.dotnet/tools:/mnt/c/Users/mcwindy/AppData/Local/Programs/oh-my-posh/bin:/mnt/c/Users/mcwindy/AppData/Roaming/npm INFO 2022-03-29 23:39:29,074 env.py: 50: PULSE_SERVER: /mnt/wslg/PulseServer INFO 2022-03-29 23:39:29,074 env.py: 50: PWD: /home/mcwindy/vissltest INFO 2022-03-29 23:39:29,074 env.py: 50: RANK: 0 INFO 2022-03-29 23:39:29,074 env.py: 50: SHELL: /bin/zsh INFO 2022-03-29 23:39:29,074 env.py: 50: SHLVL: 1 INFO 2022-03-29 23:39:29,074 env.py: 50: TERM: xterm-256color INFO 2022-03-29 23:39:29,074 env.py: 50: TERM_PROGRAM: vscode INFO 2022-03-29 23:39:29,074 env.py: 50: TERM_PROGRAM_VERSION: 1.65.2 INFO 2022-03-29 23:39:29,074 env.py: 50: USER: mcwindy INFO 2022-03-29 23:39:29,074 env.py: 50: VSCODE_GIT_ASKPASS_EXTRA_ARGS: INFO 2022-03-29 23:39:29,074 env.py: 50: VSCODE_GIT_ASKPASS_MAIN: /home/mcwindy/.vscode-server/bin/c722ca6c7eed3d7987c0d5c3df5c45f6b15e77d1/extensions/git/dist/askpass-main.js INFO 2022-03-29 23:39:29,075 env.py: 50: VSCODE_GIT_ASKPASS_NODE: /home/mcwindy/.vscode-server/bin/c722ca6c7eed3d7987c0d5c3df5c45f6b15e77d1/node INFO 2022-03-29 23:39:29,075 env.py: 50: VSCODE_GIT_IPC_HANDLE: /mnt/wslg/runtime-dir/vscode-git-853611fa42.sock INFO 2022-03-29 23:39:29,075 env.py: 50: VSCODE_IPC_HOOK_CLI: /mnt/wslg/runtime-dir/vscode-ipc-244fb306-5360-490b-b817-3e6d21c01b48.sock INFO 2022-03-29 23:39:29,075 env.py: 50: WAYLAND_DISPLAY: wayland-0 INFO 2022-03-29 23:39:29,075 env.py: 50: WORLD_SIZE: 1 INFO 2022-03-29 23:39:29,075 env.py: 50: WSLENV: VSCODE_WSL_EXT_LOCATION/up INFO 2022-03-29 23:39:29,075 env.py: 50: WSL_DISTRO_NAME: Ubuntu INFO 2022-03-29 23:39:29,075 env.py: 50: WSL_INTEROP: /run/WSL/11_interop INFO 2022-03-29 23:39:29,075 env.py: 50: XDG_RUNTIMEDIR: /mnt/wslg/runtime-dir INFO 2022-03-29 23:39:29,075 env.py: 50: ZSH: /home/mcwindy/.oh-my-zsh INFO 2022-03-29 23:39:29,075 env.py: 50: : /usr/bin/python3 INFO 2022-03-29 23:39:29,075 env.py: 50: all_proxy: INFO 2022-03-29 23:39:29,075 env.py: 50: http_proxy: http://172.28.0.1:7890 INFO 2022-03-29 23:39:29,075 env.py: 50: https_proxy: http://172.28.0.1:7890 INFO 2022-03-29 23:39:29,075 misc.py: 161: Set start method of multiprocessing to fork INFO 2022-03-29 23:39:29,075 train.py: 105: Setting seed.... INFO 2022-03-29 23:39:29,076 misc.py: 173: MACHINE SEED: 0 INFO 2022-03-29 23:39:29,274 hydra_config.py: 132: Training with config: INFO 2022-03-29 23:39:29,278 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': 'checkpoints1', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': False, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 32, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': ['disk_folder'], 'LABEL_TYPE': 'standard', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'Resize', 'size': 256}, {'name': 'CenterCrop', 'size': 224}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 32, 'COLLATE_FUNCTION': 'moco_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['disk_folder'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': ['disk_folder'], 'LABEL_TYPE': 'standard', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip'}, {'brightness': 0.4, 'contrast': 0.4, 'hue': 0.4, 'name': 'ColorJitter', 'saturation': 0.4}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': False, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': True}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 100, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'moco_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1, 5]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': 'accuracy_list_meter', 'names': ['accuracy_list_meter'], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 1000]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'TRUNK_PARAMS': {'RESNETS': {'DEPTH': 50}}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'fork', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 0.0001}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': True, 'non_regularized_parameters': [], 'num_epochs': 105, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.1, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60, 90, 100], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.0125, 0.00125, 0.000125, 1.25e-05, 1.25e-06]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.1, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': [], 'lengths': [], 'milestones': [30, 60, 90, 100], 'name': 'multistep', 'schedulers': [], 'start_value': 0.1, 'update_interval': 'epoch', 'value': 0.1, 'values': [0.0125, 0.00125, 0.000125, 1.25e-05, 1.25e-06]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': False, 'use_zero': False, 'weight_decay': 0.0001}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': True, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': True} INFO 2022-03-29 23:39:29,647 train.py: 117: System config:


sys.platform linux Python 3.9.7 (default, Sep 10 2021, 14:59:43) [GCC 11.2.0] numpy 1.19.5 Pillow 9.0.1 vissl 0.1.6 @/home/mcwindy/.local/lib/python3.9/site-packages/vissl GPU available True GPU 0 NVIDIA GeForce RTX 2080 CUDA_HOME /usr/local/cuda-11.5/targets/x86_64-linux/include/ torchvision 0.12.0+cu102 @/home/mcwindy/.local/lib/python3.9/site-packages/torchvision hydra 1.0.7 @/home/mcwindy/.local/lib/python3.9/site-packages/hydra classy_vision 0.7.0.dev @/home/mcwindy/.local/lib/python3.9/site-packages/classy_vision tensorboard 2.8.0 apex 0.1 @/home/mcwindy/.local/lib/python3.9/site-packages/apex cv2 4.5.5 PyTorch 1.11.0+cu102 @/home/mcwindy/.local/lib/python3.9/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 48 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 2 Core(s) per socket 12 Socket(s) 1 Vendor ID AuthenticAMD CPU family 25 Model 33 Model name AMD Ryzen 9 5900X 12-Core Processor Stepping 0 CPU MHz 3900.006 BogoMIPS 7800.01 Virtualization AMD-V Hypervisor vendor Microsoft Virtualization type full L1d cache 384 KiB L1i cache 384 KiB L2 cache 6 MiB L3 cache 32 MiB Vulnerability Itlb multihit Not affected Vulnerability L1tf Not affected Vulnerability Mds Not affected Vulnerability Meltdown Not affected Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2 Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds Not affected Vulnerability Tsx async abort Not affected


WARNING 2022-03-29 23:39:29,647 moco_hooks.py: 45: Batch shuffling: True INFO 2022-03-29 23:39:29,647 tensorboard.py: 49: Tensorboard dir: checkpoints1/tb_logs INFO 2022-03-29 23:39:29,648 tensorboard_hook.py: 90: Setting up SSL Tensorboard Hook... INFO 2022-03-29 23:39:29,649 tensorboard_hook.py: 102: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0 INFO 2022-03-29 23:39:29,649 trainer_main.py: 112: Using Distributed init method: tcp://localhost:50653, world_size: 1, rank: 0 INFO 2022-03-29 23:39:29,650 trainer_main.py: 130: | initialized host mcwindy_pc as rank 0 (0) INFO 2022-03-29 23:39:31,911 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-03-29 23:39:31,912 train_task.py: 455: Building model.... INFO 2022-03-29 23:39:31,912 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-03-29 23:39:31,912 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-03-29 23:39:32,265 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-03-29 23:39:32,265 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-03-29 23:39:32,305 optimizer_helper.py: 293: Trainable params: 161, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 2, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-03-29 23:39:32,306 ssl_dataset.py: 156: Rank: 0 split: TEST Data files: ['/home/mcwindy/vissltest/data1/tiny-imagenet-200/val'] INFO 2022-03-29 23:39:32,306 ssl_dataset.py: 159: Rank: 0 split: TEST Label files: ['/home/mcwindy/vissltest/data1/tiny-imagenet-200/val'] INFO 2022-03-29 23:39:32,323 disk_dataset.py: 86: Loaded 10000 samples from folder /home/mcwindy/vissltest/data1/tiny-imagenet-200/val INFO 2022-03-29 23:39:32,323 ssl_dataset.py: 156: Rank: 0 split: TRAIN Data files: ['/home/mcwindy/vissltest/data1/tiny-imagenet-200/train'] INFO 2022-03-29 23:39:32,324 ssl_dataset.py: 159: Rank: 0 split: TRAIN Label files: ['/home/mcwindy/vissltest/data1/tiny-imagenet-200/train'] INFO 2022-03-29 23:39:32,543 disk_dataset.py: 86: Loaded 100000 samples from folder /home/mcwindy/vissltest/data1/tiny-imagenet-200/train INFO 2022-03-29 23:39:32,543 misc.py: 161: Set start method of multiprocessing to fork INFO 2022-03-29 23:39:32,543 init.py: 126: Created the Distributed Sampler.... INFO 2022-03-29 23:39:32,543 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10000, 'total_size': 10000, 'shuffle': True, 'seed': 0} INFO 2022-03-29 23:39:32,544 init.py: 215: Wrapping the dataloader to async device copies INFO 2022-03-29 23:39:32,544 misc.py: 161: Set start method of multiprocessing to fork INFO 2022-03-29 23:39:32,544 init.py: 126: Created the Distributed Sampler.... INFO 2022-03-29 23:39:32,544 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 100000, 'total_size': 100000, 'shuffle': True, 'seed': 0} INFO 2022-03-29 23:39:32,544 init.py: 215: Wrapping the dataloader to async device copies INFO 2022-03-29 23:39:32,544 train_task.py: 384: Building loss... INFO 2022-03-29 23:39:32,607 trainer_main.py: 268: Training 105 epochs INFO 2022-03-29 23:39:32,607 trainer_main.py: 269: One epoch = 3125 iterations. INFO 2022-03-29 23:39:32,607 trainer_main.py: 270: Total 100000 samples in one epoch INFO 2022-03-29 23:39:32,607 trainer_main.py: 276: Total 328125 iterations for training INFO 2022-03-29 23:39:32,674 logger.py: 84: Tue Mar 29 23:39:32 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.60.02 Driver Version: 512.15 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:0A:00.0 On | N/A | | 26% 33C P2 45W / 245W | 2460MiB / 8192MiB | 11% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3244 C /python3.9 N/A | +-----------------------------------------------------------------------------+

INFO 2022-03-29 23:39:32,675 trainer_main.py: 173: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList( (0): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=1000, bias=True) ) ) ) ) INFO 2022-03-29 23:39:32,675 trainer_main.py: 174: Loss is: {'name': 'MoCoLoss'} INFO 2022-03-29 23:39:32,675 trainer_main.py: 175: Starting training.... INFO 2022-03-29 23:39:32,676 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 100000, 'total_size': 100000, 'shuffle': True, 'seed': 0} INFO 2022-03-29 23:39:32,837 ssl_dataset.py: 238: Using disk_folder labels from /home/mcwindy/vissltest/data1/tiny-imagenet-200/train INFO 2022-03-29 23:39:32,838 ssl_dataset.py: 238: Using disk_folder labels from /home/mcwindy/vissltest/data1/tiny-imagenet-200/train INFO 2022-03-29 23:39:32,838 ssl_dataset.py: 238: Using disk_folder labels from /home/mcwindy/vissltest/data1/tiny-imagenet-200/train INFO 2022-03-29 23:39:32,838 ssl_dataset.py: 238: Using disk_folder labels from /home/mcwindy/vissltest/data1/tiny-imagenet-200/train INFO 2022-03-29 23:39:32,839 ssl_dataset.py: 238: Using disk_folder labels from /home/mcwindy/vissltest/data1/tiny-imagenet-200/train Traceback (most recent call last): File "/home/mcwindy/vissltest/./tools/run_distributed_engines.py", line 200, in hydra_main(overrides=overrides) File "/home/mcwindy/vissltest/./tools/run_distributed_engines.py", line 175, in hydra_main launch_distributed( File "/home/mcwindy/vissltest/./tools/run_distributed_engines.py", line 115, in launch_distributed _distributed_worker( File "/home/mcwindy/vissltest/./tools/run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "/home/mcwindy/vissltest/./tools/run_distributed_engines.py", line 152, in process_main train_main( File "/home/mcwindy/.local/lib/python3.9/site-packages/vissl/engines/train.py", line 130, in train_main trainer.train() File "/home/mcwindy/.local/lib/python3.9/site-packages/vissl/trainer/trainer_main.py", line 178, in train self._advance_phase(task) # advances task.phase_idx File "/home/mcwindy/.local/lib/python3.9/site-packages/vissl/trainer/trainer_main.py", line 319, in _advance_phase task.recreate_data_iterator( File "/home/mcwindy/.local/lib/python3.9/site-packages/vissl/trainer/train_task.py", line 564, in recreate_data_iterator self.data_iterator = iter(self.dataloaders[phase_type]) File "/home/mcwindy/.local/lib/python3.9/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 40, in iter self.preload() File "/home/mcwindy/.local/lib/python3.9/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 46, in preload self.cache_next = next(self._iter) File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/mcwindy/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/home/mcwindy/.local/lib/python3.9/site-packages/vissl/data/collators/moco_collator.py", line 45, in moco_collator "data": [torch.stack(data).squeeze()[:, 0, :, :, :].squeeze()], # encoder IndexError: too many indices for tensor of dimension 4



5. please simplify the steps as much as possible so they do not require additional resources to
   run, such as a private dataset.

> tree
<
.
β”œβ”€β”€ checkpoints1
β”‚   β”œβ”€β”€ log.txt
β”‚   β”œβ”€β”€ stdout.json
β”‚   └── train_config.yaml
β”œβ”€β”€ configs
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ __pycache__
β”‚   β”‚   └── __init__.cpython-39.pyc
β”‚   └── config
β”‚       β”œβ”€β”€ dataset_catalog.json
β”‚       └── supervised_1gpu_resnet_example.yaml
β”œβ”€β”€ dummy_data
β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ class1
β”‚   β”‚   β”‚   β”œβ”€β”€ img1.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ img2.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ img3.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ img4.jpg
β”‚   β”‚   β”‚   └── img5.jpg
β”‚   β”‚   └── class2
β”‚   β”‚       β”œβ”€β”€ img1.jpg
β”‚   β”‚       β”œβ”€β”€ img2.jpg
β”‚   β”‚       β”œβ”€β”€ img3.jpg
β”‚   β”‚       β”œβ”€β”€ img4.jpg
β”‚   β”‚       └── img5.jpg
β”‚   └── val
β”‚       β”œβ”€β”€ class1
β”‚       β”‚   β”œβ”€β”€ img1.jpg
β”‚       β”‚   β”œβ”€β”€ img2.jpg
β”‚       β”‚   β”œβ”€β”€ img3.jpg
β”‚       β”‚   β”œβ”€β”€ img4.jpg
β”‚       β”‚   └── img5.jpg
β”‚       └── class2
β”‚           β”œβ”€β”€ img1.jpg
β”‚           β”œβ”€β”€ img2.jpg
β”‚           β”œβ”€β”€ img3.jpg
β”‚           β”œβ”€β”€ img4.jpg
β”‚           └── img5.jpg
β”œβ”€β”€ log.txt
β”œβ”€β”€ stdout.json
β”œβ”€β”€ tmp.ipynb
β”œβ”€β”€ tools
β”‚   └── run_distributed_engines.py
└── train_config.yaml

> cat vissltest/configs/config/dataset_catalog.json
< {"dummy_data_folder": {"test": ["/home/mcwindy/vissltest/data1/tiny-imagenet-200/test", "/home/mcwindy/vissltest/data1/tiny-imagenet-200/test"], "train": ["/home/mcwindy/vissltest/data1/tiny-imagenet-200/train", "/home/mcwindy/vissltest/data1/tiny-imagenet-200/train"], "val": ["/home/mcwindy/vissltest/data1/tiny-imagenet-200/val", "/home/mcwindy/vissltest/data1/tiny-imagenet-200/val"]}}

**If I comment the "COLLATE_FUNCTION" and "LOSS" in yaml, the program will run normally.**

## When to expect Triage

Within one week.
QuentinDuval commented 2 years ago

Hi @mcwindy,

First of all, thanks a lot for considering VISSL :)

I had a quick look at the configuration you linked (the replacement for supervised_1gpu_resnet_example.yaml you linked) and found things that might explain the issue. For instance:

But before you proceed with those changes, please consider using the configuration there as a better starting point for MOCO experimentations: configs/config/pretrain/moco/moco_1node_resnet.yaml

This configuration has everything set-up for MOCO (augmentations, projection, loss, etc) and you could instead start from it to avoid having to deal with configuration issues.

Please tell me if that works for you, Quentin