facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.25k stars 331 forks source link

Training in multi-node GPU setup doesn't work #480

Closed blazejdolicki closed 2 years ago

blazejdolicki commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

  1. what changes you made (git diff) or what code you wrote I'm trying to run an example pretraining script (simclr, synthetic dataset) on SLURM with multi-node setup. It works for single-node, but doesn't with multi-node. I installed vissl with conda according to the instructions in "Get started". Here's the job script "train_nct_conda_vissl.job" that I run
    
    #!/bin/bash
    #SBATCH -N 2 #number of nodes
    #SBATCH -p gpu_titanrtx_short
    #SBATCH --gpus-per-node=titanrtx:4 # use all 4 GPUs in the node
    #SBATCH --job-name=train_nct_dino
    #SBATCH -t 1:00:0
    #SBATCH --output=ssl-histo/job_logs/slurm_output_%x_%j.out

NUM_WORKERS=2 NUM_GPUS=1 NUM_TASKS=1 NUM_MACHINES=1 TRAIN=train/ SOURCE=$HOME/thesis/hissl SINGULARITYIMAGE=$HOME/thesis/hissl_20210922_np121_h5py.sif CONFIG_PATH=dummy/quick_gpu_resnet50_simclr LOGS_DIR=hissl-logs EXPERIMENT_DIR=$HOME/thesis/$LOGS_DIR EXPERIMENT_DIR_CONTAINER=/$LOGS_DIR DATA_ROOT=$HOME"/thesis/ssl-histo/data/NCT-CRC-HE-100K"

module load 2021 module load Anaconda3/2021.05 source activate thesis source activate vissl

cd $SOURCE

for multi-machine GPUs: stops the job in case of NCCL ASYNC errors

export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_DEBUG=INFO

to silence this error:

"ERROR: ld.so: object '/sara/tools/xalt/xalt/lib64/libxalt_init.so'

from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored."

unset LD_PRELOAD

python3 tools/run_distributed_engines.py \ hydra.verbose=true \ config=$CONFIG_PATH\ config.DATA.TRAIN.DATA_SOURCES=[synthetic] \ config.DATA.TRAIN.DATA_LIMIT=1000 \ config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10 \ config.CHECKPOINT.DIR=$HOME/thesis/$EXPERIMENT_DIR_CONTAINER/$SLURM_JOB_NAME/checkpoints/$SLURM_JOB_ID \ config.DISTRIBUTED.NUM_NODES=2 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \ config.DISTRIBUTED.RUN_ID=localhost:46357

Here's the config file "quick_gpu_resnet50_simclr.yaml" I'm using:

@package global

config: VERBOSE: False LOG_FREQUENCY: 1 TEST_ONLY: False TEST_MODEL: False SEED_VALUE: 0 MULTI_PROCESSING_METHOD: forkserver MONITOR_PERF_STATS: True PERF_STAT_FREQUENCY: 10 ROLLING_BTIME_FREQ: 5 DATA: NUM_DATALOADER_WORKERS: 5 TRAIN: DATA_SOURCES: [disk_folder] DATASET_NAMES: [dummy_data_folder] BATCHSIZE_PER_REPLICA: 2 LABEL_TYPE: sample_index # just an implementation detail. Label isn't used TRANSFORMS:

2. what exact command you run:
In my terminal: "sbatch train_nct_dino_conda_vissl.job"
3. what you observed (including __full logs__):
For single node, the training finishes successfully after <1 min and for two nodes it seems to hang (and is terminated by SLURM after a limit of 1 hour is reached).
Below I'm showing the part where things seem to go wrong and later I display full logs:
Single node (snippet):

NUMA node3 CPU(s) 3,7,11,15,19,23


INFO 2021-12-02 18:52:05,682 trainer_main.py: 112: Using Distributed init method: tcp://localhost:33241, world_size: 1, rank: 0 r29n2:27102:27102 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27102:27102 [0] NCCL INFO NET/IB : No device found. r29n2:27102:27102 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27102:27224 [0] NCCL INFO Channel 00/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 01/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 02/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 03/32 : 0

Two nodes (snippet):

NUMA node3 CPU(s) 3,7,11,15,19,23


INFO 2021-12-02 18:53:19,001 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 0 r29n2:27632:27632 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27632:27632 [0] NCCL INFO NET/IB : No device found. r29n2:27632:27632 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27634:27634 [2] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27633:27633 [1] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27635:27635 [3] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27634:27634 [2] NCCL INFO NET/IB : No device found. r29n2:27634:27634 [2] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO Using network Socket slurmstepd: error: JOB 8462605 ON r29n2 CANCELLED AT 2021-12-02T19:53:21 DUE TO TIME LIMIT


Single node (full logs):

####### overrides: ['hydra.verbose=true', 'config=dummy/quick_gpu_resnet50_simclr', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.DATA.TRAIN.DATA_LIMIT=1000', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10', 'config.CHECKPOINT.DIR=/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603', 'hydra.verbose=true'] INFO 2021-12-02 18:52:05,335 distributed_launcher.py: 183: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:33241 INFO 2021-12-02 18:52:05,336 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_ENV: /opt/lmod/lmod/init/bash INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_FUNC_ml%%: () { eval $($LMOD_DIR/ml_cmd "$@") } INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_FUNC_module%%: () { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_DEFAULT_ENV: vissl INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/conda INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_PREFIX: /home/bdolicki/.conda/envs/vissl INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_PREFIX_1: /home/bdolicki/.conda/envs/thesis INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_PROMPT_MODIFIER: (vissl) INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_PYTHON_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/python INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_SHLVL: 2 INFO 2021-12-02 18:52:05,337 env.py: 50: CUDA_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:52:05,337 env.py: 50: DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/55916/bus INFO 2021-12-02 18:52:05,337 env.py: 50: EBDEVELANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/easybuild/Anaconda3-2021.05-easybuild-devel INFO 2021-12-02 18:52:05,337 env.py: 50: EBROOTANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: EBVERSIONANACONDA3: 2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: ENVIRONMENT: BATCH INFO 2021-12-02 18:52:05,337 env.py: 50: FPATH: /opt/lmod/lmod/init/ksh_funcs INFO 2021-12-02 18:52:05,337 env.py: 50: GPU_DEVICE_ORDINAL: 0,1,2,3 INFO 2021-12-02 18:52:05,337 env.py: 50: HOME: /home/bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: HOSTNAME: r29n2 INFO 2021-12-02 18:52:05,337 env.py: 50: LANG: en_US INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_CASE_INDEPENDENT_SORTING: yes INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_CMD: /opt/lmod/lmod/libexec/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_DIR: /opt/lmod/lmod/libexec INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_EXACT_MATCH: yes INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_PKG: /opt/lmod/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_ROOT: /opt/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_SETTARG_FULL_SUPPORT: no INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_SHORT_TIME: 10000 INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_VERSION: 8.5.22 INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_sys: Linux INFO 2021-12-02 18:52:05,337 env.py: 50: LOADEDMODULES: 2021:Anaconda3/2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: LOCAL_RANK: 0 INFO 2021-12-02 18:52:05,337 env.py: 50: LOGNAME: bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: MAIL: /var/mail/bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:/opt/lmod/lmod/share/man::/opt/slurm/sw/current/share/man INFO 2021-12-02 18:52:05,337 env.py: 50: MODULEPATH: /sw/noarch/modulefiles/environment:/sw/arch/Debian10/EB_production/2021/modulefiles/phys:/sw/arch/Debian10/EB_production/2021/modulefiles/perf:/sw/arch/Debian10/EB_production/2021/modulefiles/geo:/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:/sw/arch/Debian10/EB_production/2021/modulefiles/chem:/sw/arch/Debian10/EB_production/2021/modulefiles/data:/sw/arch/Debian10/EB_production/2021/modulefiles/vis:/sw/arch/Debian10/EB_production/2021/modulefiles/bio:/sw/arch/Debian10/EB_production/2021/modulefiles/math:/sw/arch/Debian10/EB_production/2021/modulefiles/cae:/sw/arch/Debian10/EB_production/2021/modulefiles/system:/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:/sw/arch/Debian10/EB_production/2021/modulefiles/tools:/sw/arch/Debian10/EB_production/2021/modulefiles/lib:/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:/sw/arch/Debian10/EB_production/2021/modulefiles/lang:/sw/arch/Debian10/EB_production/2021/modulefiles/devel:/sw/noarch/Debian10/2021/modulefiles/all INFO 2021-12-02 18:52:05,337 env.py: 50: MODULEPATH_ROOT: /opt/modulefiles INFO 2021-12-02 18:52:05,337 env.py: 50: MODULESHOME: /opt/lmod/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: NCCL_ASYNC_ERROR_HANDLING: 1 INFO 2021-12-02 18:52:05,337 env.py: 50: NCCL_DEBUG: INFO INFO 2021-12-02 18:52:05,337 env.py: 50: OLDPWD: /home/bdolicki/thesis INFO 2021-12-02 18:52:05,338 env.py: 50: PATH: /home/bdolicki/.conda/envs/vissl/bin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:/sw/noarch/Debian10/2021/software/os_binary_wrappers:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/sara/bin:/opt/slurm/bin:/opt/slurm/sbin:/opt/slurm/sw/current/bin INFO 2021-12-02 18:52:05,338 env.py: 50: PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig INFO 2021-12-02 18:52:05,338 env.py: 50: PWD: /home/bdolicki/thesis/hissl INFO 2021-12-02 18:52:05,338 env.py: 50: RANK: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: ROCR_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:52:05,338 env.py: 50: SHELL: /bin/bash INFO 2021-12-02 18:52:05,338 env.py: 50: SHLVL: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURMD_NODENAME: r29n2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CLUSTER_NAME: lisa INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CONF: /opt/slurm/etc/slurm.conf INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CPUS_ON_NODE: 24 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_GPUS_PER_NODE: titanrtx:4 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_GTIDS: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOBID: 8462603 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_ACCOUNT: bdolicki INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_CPUS_PER_NODE: 24(x2) INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_GID: 55479 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_GPUS: 0,1,2,3 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_ID: 8462603 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NAME: train_nct_dino INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NODELIST: r29n[2,5] INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NUM_NODES: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_PARTITION: gpu_titanrtx_short INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_QOS: default INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_UID: 55916 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_USER: bdolicki INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_LOCALID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NNODES: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODEID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODELIST: r29n[2,5] INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODE_ALIASES: (null) INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_PRIO_PROCESS: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_PROCID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SPANK_SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SUBMIT_DIR: /home/bdolicki/thesis INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SUBMIT_HOST: login3.lisa.surfsara.nl INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_TASKS_PER_NODE: 24(x2) INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TASK_PID: 27062 INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TOPOLOGY_ADDR: gigabit..gpu.I09_I10_I15_I16_I17_I19.r29n2 INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TOPOLOGY_ADDR_PATTERN: switch.switch.switch.switch.node INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_WORKING_CLUSTER: lisa:batch4.lisa.surfsara.nl:6817:9216:109 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_CLIENT: 86.83.160.29 51594 22 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_CONNECTION: 86.83.160.29 51594 145.101.32.96 22 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_TTY: /dev/pts/13 INFO 2021-12-02 18:52:05,339 env.py: 50: SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:52:05,339 env.py: 50: TAR: /bin/tar INFO 2021-12-02 18:52:05,339 env.py: 50: TERM: xterm-256color INFO 2021-12-02 18:52:05,339 env.py: 50: TMPDIR: /scratch INFO 2021-12-02 18:52:05,339 env.py: 50: USER: bdolicki INFO 2021-12-02 18:52:05,339 env.py: 50: WORLD_SIZE: 1 INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_EXECUTABLE_TRACKING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_GPU_TRACKING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_SAMPLING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_RUNTIME_DIR: /run/user/55916 INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSION_CLASS: user INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSION_ID: c1889 INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSIONTYPE: tty INFO 2021-12-02 18:52:05,339 env.py: 50: : /home/bdolicki/.conda/envs/vissl/bin/python3 INFO 2021-12-02 18:52:05,339 env.py: 50: _CE_CONDA:
INFO 2021-12-02 18:52:05,339 env.py: 50: _CE_M:
INFO 2021-12-02 18:52:05,339 env.py: 50: LMFILES: /sw/noarch/modulefiles/environment/2021.lua:/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable001: X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFtaWx5ID0ge30sCm1UID0gewpbIjIwMjEiXSA9IHsKZm4gPSAiL3N3L25vYXJjaC9tb2R1bGVmaWxlcy9lbnZpcm9ubWVudC8yMDIxLmx1YSIsCmZ1bGxOYW1lID0gIjIwMjEiLApsb2FkT3JkZXIgPSAxLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIjIwMjEiLAp3ViA9ICJNLip6ZmluYWwiLAp9LApBbmFjb25kYTMgPSB7CmZuID0gIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9sYW5nL0FuYWNvbmRhMy8yMDIx INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable002: LjA1Lmx1YSIsCmZ1bGxOYW1lID0gIkFuYWNvbmRhMy8yMDIxLjA1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJBbmFjb25kYTMvMjAyMS4wNSIsCndWID0gIjAwMDAwMjAyMS4wMDAwMDAwMDUuKnpmaW5hbCIsCn0sCn0sCm1wYXRoQSA9IHsKIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9waHlzIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvcGVyZiIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21v INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable003: ZHVsZWZpbGVzL2dlbyIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RlYnVnZ2VyIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2hlbSIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RhdGEiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy92aXMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9iaW8iCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9tYXRoIgosICIvc3cvYXJjaC9EZWJpYW4x INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable004: MC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2FlIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvc3lzdGVtIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbGNoYWluIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbnVtbGliIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbXBpIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbHMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVm INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable005: aWxlcy9saWIiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9jb21waWxlciIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2xhbmciCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9kZXZlbCIsICIvc3cvbm9hcmNoL0RlYmlhbjEwLzIwMjEvbW9kdWxlZmlsZXMvYWxsIiwKfSwKc3lzdGVtQmFzZU1QQVRIID0gIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiLAp9Cg== INFO 2021-12-02 18:52:05,339 env.py: 50: _ModuleTableSz: 5 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_LOADEDMODULES: 2021:1;Anaconda3/2021.05:1 INFO 2021-12-02 18:52:05,339 env.py: 50: __LMOD_REF_COUNT_MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:1;/opt/lmod/lmod/share/man:1;/opt/slurm/sw/current/share/man:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_MODULEPATH: /sw/noarch/modulefiles/environment:1;/sw/arch/Debian10/EB_production/2021/modulefiles/phys:1;/sw/arch/Debian10/EB_production/2021/modulefiles/perf:1;/sw/arch/Debian10/EB_production/2021/modulefiles/geo:1;/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:1;/sw/arch/Debian10/EB_production/2021/modulefiles/chem:1;/sw/arch/Debian10/EB_production/2021/modulefiles/data:1;/sw/arch/Debian10/EB_production/2021/modulefiles/vis:1;/sw/arch/Debian10/EB_production/2021/modulefiles/bio:1;/sw/arch/Debian10/EB_production/2021/modulefiles/math:1;/sw/arch/Debian10/EB_production/2021/modulefiles/cae:1;/sw/arch/Debian10/EB_production/2021/modulefiles/system:1;/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:1;/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:1;/sw/arch/Debian10/EB_production/2021/modulefiles/tools:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang:1;/sw/arch/Debian10/EB_production/2021/modulefiles/devel:1;/sw/noarch/Debian10/2021/modulefiles/all:1 INFO 2021-12-02 18:52:05,339 env.py: 50: __LMOD_REF_COUNT_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:1;/sw/noarch/Debian10/2021/software/os_binary_wrappers:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:1;/usr/bin:1;/bin:1;/usr/bin/X11:1;/usr/games:1;/usr/sara/bin:1;/opt/slurm/bin:1;/opt/slurm/sbin:1;/opt/slurm/sw/current/bin:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNTLMFILES_: /sw/noarch/modulefiles/environment/2021.lua:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_SET_FPATH: 1 INFO 2021-12-02 18:52:05,340 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:52:05,340 train.py: 105: Setting seed.... INFO 2021-12-02 18:52:05,340 misc.py: 173: MACHINE SEED: 0 INFO 2021-12-02 18:52:05,346 hydra_config.py: 132: Training with config: INFO 2021-12-02 18:52:05,352 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 10, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': 1000, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': False, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 20, 'embedding_dim': 128, 'world_size': 1}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MONITOR_PERF_STATS': True, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 1, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PERF_STAT_FREQUENCY': 10, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'ROLLING_BTIME_FREQ': 5, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-12-02 18:52:05,679 train.py: 117: System config:


sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 1000.127 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23


INFO 2021-12-02 18:52:05,682 trainer_main.py: 112: Using Distributed init method: tcp://localhost:33241, world_size: 1, rank: 0 r29n2:27102:27102 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27102:27102 [0] NCCL INFO NET/IB : No device found. r29n2:27102:27102 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27102:27224 [0] NCCL INFO Channel 00/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 01/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 02/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 03/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 04/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 05/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 06/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 07/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 08/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 09/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 10/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 11/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 12/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 13/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 14/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 15/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 16/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 17/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 18/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 19/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 20/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 21/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 22/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 23/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 24/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 25/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 26/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 27/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 28/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 29/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 30/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 31/32 : 0 r29n2:27102:27224 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-1 r29n2:27102:27224 [0] NCCL INFO Setting affinity for GPU 0 to 111111 r29n2:27102:27224 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer r29n2:27102:27224 [0] NCCL INFO comm 0x147b84001060 rank 0 nranks 1 cudaDev 0 busId 3b000 - Init COMPLETE INFO 2021-12-02 18:52:08,991 trainer_main.py: 130: | initialized host r29n2.lisa.surfsara.nl as rank 0 (0) INFO 2021-12-02 18:52:08,992 train_task.py: 181: Not using Automatic Mixed Precision INFO 2021-12-02 18:52:08,993 train_task.py: 455: Building model.... INFO 2021-12-02 18:52:08,993 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2021-12-02 18:52:08,993 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2021-12-02 18:52:09,666 model_helpers.py: 177: Using SyncBN group size: 1 INFO 2021-12-02 18:52:09,666 model_helpers.py: 192: Converting BN layers to PyTorch SyncBN INFO 2021-12-02 18:52:09,673 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2021-12-02 18:52:09,673 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2021-12-02 18:52:09,722 optimizer_helper.py: 293: Trainable params: 163, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 4, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2021-12-02 18:52:09,723 img_replicate_pil.py: 52: ImgReplicatePil | Using num_times: 2 INFO 2021-12-02 18:52:09,723 img_pil_color_distortion.py: 56: ImgPilColorDistortion | Using strength: 1.0 INFO 2021-12-02 18:52:09,724 ssl_dataset.py: 156: Rank: 0 split: TRAIN Data files: [''] INFO 2021-12-02 18:52:09,724 ssl_dataset.py: 159: Rank: 0 split: TRAIN Label files: [] INFO 2021-12-02 18:52:09,724 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:52:09,724 init.py: 126: Created the Distributed Sampler.... INFO 2021-12-02 18:52:09,724 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 1000, 'total_size': 1000, 'shuffle': True, 'seed': 0} INFO 2021-12-02 18:52:09,724 init.py: 215: Wrapping the dataloader to async device copies INFO 2021-12-02 18:52:09,724 train_task.py: 384: Building loss... INFO 2021-12-02 18:52:09,726 simclr_info_nce_loss.py: 91: Creating Info-NCE loss on Rank: 0 INFO 2021-12-02 18:52:09,726 trainer_main.py: 268: Training 1 epochs INFO 2021-12-02 18:52:09,726 trainer_main.py: 269: One epoch = 100 iterations. INFO 2021-12-02 18:52:09,726 trainer_main.py: 270: Total 1000 samples in one epoch INFO 2021-12-02 18:52:09,726 trainer_main.py: 276: Total 100 iterations for training INFO 2021-12-02 18:52:10,688 logger.py: 84: Thu Dec 2 18:52:10 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX On | 00000000:3B:00.0 Off | N/A | | 40% 40C P2 67W / 280W | 970MiB / 24220MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX On | 00000000:5E:00.0 Off | N/A | | 40% 33C P8 10W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX On | 00000000:B1:00.0 Off | N/A | | 41% 32C P8 23W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX On | 00000000:D9:00.0 Off | N/A | | 41% 33C P8 19W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 27102 C python3 967MiB | +-----------------------------------------------------------------------------+

INFO 2021-12-02 18:52:10,693 trainer_main.py: 173: Model is: Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>: BaseSSLMultiInputOutputModel( (_heads): ModuleDict() (trunk): ResNeXt( (_feature_blocks): ModuleDict( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv1_relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False) (1): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): SyncBatchNorm(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (flatten): Flatten() ) ) (heads): ModuleList( (0): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=2048, bias=True) ) ) (1): MLP( (clf): Sequential( (0): Linear(in_features=2048, out_features=128, bias=True) ) ) ) ) INFO 2021-12-02 18:52:10,694 trainer_main.py: 174: Loss is: { 'info_average': { 'dist_rank': 0, 'name': 'SimclrInfoNCECriterion', 'num_negatives': 18, 'num_pos': 2, 'temperature': 0.1}, 'name': 'SimclrInfoNCELoss'} INFO 2021-12-02 18:52:10,694 trainer_main.py: 175: Starting training.... INFO 2021-12-02 18:52:10,695 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 1000, 'total_size': 1000, 'shuffle': True, 'seed': 0} /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/initialize.py:67: UserWarning: hydra.experimental.initialize_config_module() is no longer experimental. Use hydra.initialize_config_module(). deprecation_warning( /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/compose.py:18: UserWarning: hydra.experimental.compose() is no longer experimental. Use hydra.compose() deprecation_warning( INFO 2021-12-02 18:52:12,065 trainer_main.py: 333: Phase advanced. Rank: 0 INFO 2021-12-02 18:52:12,067 log_hooks.py: 76: ========= Memory Summary at on_phase_start ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 121779 KB 121779 KB 121781 KB 1536 B
from large pool 102912 KB 102912 KB 102912 KB 0 B
from small pool 18867 KB 18867 KB 18869 KB 1536 B
---------------------------------------------------------------------------
Active memory 121779 KB 121779 KB 121781 KB 1536 B
from large pool 102912 KB 102912 KB 102912 KB 0 B
from small pool 18867 KB 18867 KB 18869 KB 1536 B
---------------------------------------------------------------------------
GPU reserved memory 137216 KB 137216 KB 137216 KB 0 B
from large pool 114688 KB 114688 KB 114688 KB 0 B
from small pool 22528 KB 22528 KB 22528 KB 0 B
---------------------------------------------------------------------------
Non-releasable memory 15436 KB 28795 KB 95385 KB 79948 KB
from large pool 11776 KB 28160 KB 74496 KB 62720 KB
from small pool 3660 KB 3661 KB 20889 KB 17228 KB
---------------------------------------------------------------------------
Allocations 328 328 331 3
from large pool 19 19 19 0
from small pool 309 309 312 3
---------------------------------------------------------------------------
Active allocs 328 328 331 3
from large pool 19 19 19 0
from small pool 309 309 312 3
---------------------------------------------------------------------------
GPU reserved segments 17 17 17 0
from large pool 6 6 6 0
from small pool 11 11 11 0
---------------------------------------------------------------------------
Non-releasable allocs 8 8 19 11
from large pool 4 5 5 1
from small pool 4 5 14 10
===========================================================================
INFO 2021-12-02 18:52:12,067 state_update_hooks.py: 115: Starting phase 0 [train] INFO 2021-12-02 18:52:12,703 log_hooks.py: 76: ========= Memory Summary at on_forward ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 1760 MB 4113 MB 17969 MB 16209 MB
from large pool 1741 MB 4095 MB 17948 MB 16207 MB
from small pool 18 MB 18 MB 20 MB 1 MB
---------------------------------------------------------------------------
Active memory 1760 MB 4113 MB 17969 MB 16209 MB
from large pool 1741 MB 4095 MB 17948 MB 16207 MB
from small pool 18 MB 18 MB 20 MB 1 MB
---------------------------------------------------------------------------
GPU reserved memory 4714 MB 5848 MB 10288 MB 5574 MB
from large pool 4692 MB 5826 MB 10266 MB 5574 MB
from small pool 22 MB 22 MB 22 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 569238 KB 1821 MB 8165 MB 7609 MB
from large pool 566120 KB 1818 MB 8143 MB 7590 MB
from small pool 3118 KB 3 MB 21 MB 18 MB
---------------------------------------------------------------------------
Allocations 545 545 680 135
from large pool 124 124 161 37
from small pool 421 421 519 98
---------------------------------------------------------------------------
Active allocs 545 545 680 135
from large pool 124 124 161 37
from small pool 421 421 519 98
---------------------------------------------------------------------------
GPU reserved segments 24 25 29 5
from large pool 13 14 18 5
from small pool 11 11 11 0
---------------------------------------------------------------------------
Non-releasable allocs 11 13 92 81
from large pool 6 8 18 12
from small pool 5 5 74 69
===========================================================================
INFO 2021-12-02 18:52:13,159 log_hooks.py: 76: ========= Memory Summary at on_backward ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 246616 KB 4176 MB 49826 MB 49585 MB
from large pool 209112 KB 4157 MB 49785 MB 49581 MB
from small pool 37504 KB 36 MB 40 MB 4 MB
---------------------------------------------------------------------------
Active memory 246616 KB 4176 MB 49826 MB 49585 MB
from large pool 209112 KB 4157 MB 49785 MB 49581 MB
from small pool 37504 KB 36 MB 40 MB 4 MB
---------------------------------------------------------------------------
GPU reserved memory 2098 MB 5848 MB 16086 MB 13988 MB
from large pool 2058 MB 5826 MB 16046 MB 13988 MB
from small pool 40 MB 40 MB 40 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 1857 MB 2355 MB 23134 MB 21276 MB
from large pool 1853 MB 2351 MB 23097 MB 21243 MB
from small pool 3 MB 4 MB 36 MB 33 MB
---------------------------------------------------------------------------
Allocations 498 560 1189 691
from large pool 38 128 406 368
from small pool 460 462 783 323
---------------------------------------------------------------------------
Active allocs 498 560 1189 691
from large pool 38 128 406 368
from small pool 460 462 783 323
---------------------------------------------------------------------------
GPU reserved segments 30 32 44 14
from large pool 10 14 24 14
from small pool 20 20 20 0
---------------------------------------------------------------------------
Non-releasable allocs 19 23 336 317
from large pool 9 13 153 144
from small pool 10 11 183 173
===========================================================================
INFO 2021-12-02 18:52:13,223 log_hooks.py: 76: ========= Memory Summary at on_update ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 356513 KB 4176 MB 50040 MB 49692 MB
from large pool 300384 KB 4157 MB 49963 MB 49669 MB
from small pool 56129 KB 54 MB 77 MB 22 MB
---------------------------------------------------------------------------
Active memory 356513 KB 4176 MB 50040 MB 49692 MB
from large pool 300384 KB 4157 MB 49963 MB 49669 MB
from small pool 56129 KB 54 MB 77 MB 22 MB
---------------------------------------------------------------------------
GPU reserved memory 2116 MB 5848 MB 16104 MB 13988 MB
from large pool 2058 MB 5826 MB 16046 MB 13988 MB
from small pool 58 MB 58 MB 58 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 1767 MB 2355 MB 23252 MB 21484 MB
from large pool 1764 MB 2351 MB 23186 MB 21421 MB
from small pool 3 MB 4 MB 65 MB 62 MB
---------------------------------------------------------------------------
Allocations 661 664 3081 2420
from large pool 56 128 442 386
from small pool 605 608 2639 2034
---------------------------------------------------------------------------
Active allocs 661 664 3081 2420
from large pool 56 128 442 386
from small pool 605 608 2639 2034
---------------------------------------------------------------------------
GPU reserved segments 39 39 53 14
from large pool 10 14 24 14
from small pool 29 29 29 0
---------------------------------------------------------------------------
Non-releasable allocs 16 23 1859 1843
from large pool 8 13 153 145
from small pool 8 13 1706 1698
===========================================================================

INFO 2021-12-02 18:52:13,224 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 0; lr: 0.6; loss: 3.07831; btime(ms): 0; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,351 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 1; lr: 1.02; loss: 2.98483; btime(ms): 3498; eta: 0:05:46; peak_mem(M): 4176; max_iterations: 100; INFO 2021-12-02 18:52:13,480 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 2; lr: 1.44; loss: 2.98822; btime(ms): 1812; eta: 0:02:57; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,609 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 3; lr: 1.86; loss: 3.02691; btime(ms): 1251; eta: 0:02:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,736 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 4; lr: 2.28; loss: 2.97272; btime(ms): 970; eta: 0:01:33; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,874 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 5; lr: 2.7; loss: 2.90045; btime(ms): 801; eta: 0:01:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,004 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 6; lr: 3.12; loss: 3.00003; btime(ms): 691; eta: 0:01:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,139 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 7; lr: 3.54; loss: 2.9386; btime(ms): 611; eta: 0:00:56; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,263 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 8; lr: 3.96; loss: 3.04946; btime(ms): 551; eta: 0:00:50; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,395 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 9; lr: 4.38; loss: 2.93521; btime(ms): 504; eta: 0:00:45; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,527 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 10; lr: 3.59685; loss: 2.9578; btime(ms): 466; eta: 0:00:42; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,662 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 11; lr: 3.58705; loss: 2.9544; btime(ms): 436; eta: 0:00:38; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,793 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 12; lr: 3.55776; loss: 2.97437; btime(ms): 411; eta: 0:00:36; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,918 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 13; lr: 3.50929; loss: 2.94129; btime(ms): 389; eta: 0:00:33; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,051 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 14; lr: 3.44218; loss: 2.93352; btime(ms): 370; eta: 0:00:31; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,182 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 15; lr: 3.35716; loss: 2.94921; btime(ms): 355; eta: 0:00:30; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,316 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 16; lr: 3.25516; loss: 2.95404; btime(ms): 341; eta: 0:00:28; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,444 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 17; lr: 3.13729; loss: 2.87209; btime(ms): 328; eta: 0:00:27; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,571 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 18; lr: 3.00483; loss: 2.97899; btime(ms): 317; eta: 0:00:26; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,705 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 19; lr: 2.85923; loss: 3.27561; btime(ms): 307; eta: 0:00:24; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,833 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 20; lr: 2.70209; loss: 2.87969; btime(ms): 298; eta: 0:00:23; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,965 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 21; lr: 2.5351; loss: 2.99437; btime(ms): 290; eta: 0:00:22; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,096 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 22; lr: 2.36011; loss: 2.98848; btime(ms): 283; eta: 0:00:22; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,223 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 23; lr: 2.17901; loss: 2.95309; btime(ms): 276; eta: 0:00:21; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,348 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 24; lr: 1.99379; loss: 2.96843; btime(ms): 270; eta: 0:00:20; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,477 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 25; lr: 1.80646; loss: 2.93082; btime(ms): 264; eta: 0:00:19; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,603 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 26; lr: 1.61906; loss: 2.93418; btime(ms): 259; eta: 0:00:19; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,737 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 27; lr: 1.43365; loss: 2.96253; btime(ms): 254; eta: 0:00:18; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,865 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 28; lr: 1.25225; loss: 2.96734; btime(ms): 250; eta: 0:00:18; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,004 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 29; lr: 1.07684; loss: 2.95698; btime(ms): 246; eta: 0:00:17; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,139 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 30; lr: 0.90932; loss: 2.95171; btime(ms): 242; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,269 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 31; lr: 0.75154; loss: 2.94174; btime(ms): 239; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,404 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 32; lr: 0.6052; loss: 2.9391; btime(ms): 235; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,532 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 33; lr: 0.47191; loss: 2.95017; btime(ms): 232; eta: 0:00:15; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,659 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 34; lr: 0.35312; loss: 2.94415; btime(ms): 229; eta: 0:00:15; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,793 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 35; lr: 0.25014; loss: 2.94375; btime(ms): 226; eta: 0:00:14; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,924 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 36; lr: 0.16407; loss: 2.94641; btime(ms): 224; eta: 0:00:14; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,058 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 37; lr: 0.09586; loss: 2.94746; btime(ms): 221; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,183 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 38; lr: 0.04626; loss: 2.93896; btime(ms): 219; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,320 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 39; lr: 0.01581; loss: 2.94771; btime(ms): 216; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,450 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 40; lr: 0.00484; loss: 2.94257; btime(ms): 214; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,577 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 41; lr: 0.00767; loss: 2.94951; btime(ms): 212; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,710 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 42; lr: 0.01699; loss: 2.9473; btime(ms): 210; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,843 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 43; lr: 0.03267; loss: 2.94617; btime(ms): 208; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,972 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 44; lr: 0.05454; loss: 2.94794; btime(ms): 207; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,103 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 45; lr: 0.08236; loss: 2.93979; btime(ms): 205; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,230 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 46; lr: 0.11583; loss: 2.94335; btime(ms): 203; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,366 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 47; lr: 0.15458; loss: 2.95152; btime(ms): 202; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,490 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 48; lr: 0.19819; loss: 2.9448; btime(ms): 200; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,624 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 49; lr: 0.24618; loss: 2.94277; btime(ms): 199; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,731 logger.py: 84: Thu Dec 2 18:52:19 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX On | 00000000:3B:00.0 Off | N/A | | 41% 46C P2 97W / 280W | 3064MiB / 24220MiB | 68% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX On | 00000000:5E:00.0 Off | N/A | | 41% 33C P8 15W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX On | 00000000:B1:00.0 Off | N/A | | 41% 32C P8 22W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX On | 00000000:D9:00.0 Off | N/A | | 41% 33C P8 18W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 27102 C python3 3061MiB | +-----------------------------------------------------------------------------+

INFO 2021-12-02 18:52:19,872 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 50; lr: 0.29803; loss: 2.94613; btime(ms): 197; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,005 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 51; lr: 0.35318; loss: 2.94412; btime(ms): 198; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,131 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 52; lr: 0.41101; loss: 2.94424; btime(ms): 197; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,264 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 53; lr: 0.47091; loss: 2.94942; btime(ms): 196; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,399 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 54; lr: 0.53222; loss: 2.9432; btime(ms): 195; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,539 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 55; lr: 0.59426; loss: 2.94541; btime(ms): 194; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,671 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 56; lr: 0.65636; loss: 2.94043; btime(ms): 193; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,806 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 57; lr: 0.71785; loss: 2.94718; btime(ms): 192; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,935 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 58; lr: 0.77805; loss: 2.93593; btime(ms): 191; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,071 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 59; lr: 0.83631; loss: 2.9267; btime(ms): 189; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,197 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 60; lr: 0.89198; loss: 2.93109; btime(ms): 189; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,328 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 61; lr: 0.94447; loss: 2.9667; btime(ms): 188; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,457 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 62; lr: 0.9932; loss: 2.92827; btime(ms): 187; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,589 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 63; lr: 1.03763; loss: 2.93745; btime(ms): 186; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,722 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 64; lr: 1.07729; loss: 2.95133; btime(ms): 185; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,856 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 65; lr: 1.11174; loss: 2.86927; btime(ms): 184; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,987 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 66; lr: 1.1406; loss: 2.97953; btime(ms): 183; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,113 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 67; lr: 1.16356; loss: 2.86067; btime(ms): 183; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,244 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 68; lr: 1.18037; loss: 3.0333; btime(ms): 182; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,378 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 69; lr: 1.19084; loss: 3.08738; btime(ms): 181; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,507 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 70; lr: 1.19487; loss: 3.02461; btime(ms): 180; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,635 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 71; lr: 1.1924; loss: 2.90518; btime(ms): 180; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,771 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 72; lr: 1.18346; loss: 2.99066; btime(ms): 179; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,896 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 73; lr: 1.16816; loss: 3.01944; btime(ms): 178; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,035 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 74; lr: 1.14666; loss: 2.97204; btime(ms): 177; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,171 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 75; lr: 1.11919; loss: 2.93396; btime(ms): 177; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,302 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 76; lr: 1.08605; loss: 2.9222; btime(ms): 176; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,436 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 77; lr: 1.0476; loss: 2.92474; btime(ms): 176; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,569 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 78; lr: 1.00427; loss: 2.92866; btime(ms): 175; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,702 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 79; lr: 0.95653; loss: 2.92798; btime(ms): 175; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,830 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 80; lr: 0.90489; loss: 2.94017; btime(ms): 174; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,962 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 81; lr: 0.84993; loss: 2.90543; btime(ms): 174; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,096 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 82; lr: 0.79223; loss: 2.94644; btime(ms): 173; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,221 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 83; lr: 0.73244; loss: 2.91061; btime(ms): 173; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,352 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 84; lr: 0.6712; loss: 2.90573; btime(ms): 172; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,491 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 85; lr: 0.60918; loss: 2.97565; btime(ms): 172; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,627 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 86; lr: 0.54706; loss: 2.86883; btime(ms): 171; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,762 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 87; lr: 0.48552; loss: 2.97494; btime(ms): 171; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,893 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 88; lr: 0.42522; loss: 2.96146; btime(ms): 170; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,019 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 89; lr: 0.36683; loss: 2.96612; btime(ms): 170; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,144 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 90; lr: 0.31099; loss: 3.02852; btime(ms): 169; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,268 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 91; lr: 0.25829; loss: 2.87463; btime(ms): 169; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,391 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 92; lr: 0.20932; loss: 2.91708; btime(ms): 168; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,515 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 93; lr: 0.16462; loss: 2.93636; btime(ms): 168; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,641 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 94; lr: 0.12466; loss: 3.03869; btime(ms): 167; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,765 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 95; lr: 0.08989; loss: 2.86524; btime(ms): 167; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,889 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 96; lr: 0.06068; loss: 2.95951; btime(ms): 167; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,013 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 97; lr: 0.03736; loss: 3.05665; btime(ms): 166; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,136 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 98; lr: 0.02018; loss: 3.05182; btime(ms): 166; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,296 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 99; lr: 0.00932; loss: 3.02523; btime(ms): 165; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,297 trainer_main.py: 214: Meters synced INFO 2021-12-02 18:52:26,297 io.py: 63: Saving data to file: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/metrics.json INFO 2021-12-02 18:52:26,299 io.py: 89: Saved data to file: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/metrics.json INFO 2021-12-02 18:52:26,299 log_hooks.py: 425: [phase: 0] Saving checkpoint to /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603 INFO 2021-12-02 18:52:26,958 checkpoint.py: 131: Saved checkpoint: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/model_final_checkpoint_phase0.torch INFO 2021-12-02 18:52:26,958 checkpoint.py: 140: Creating symlink... INFO 2021-12-02 18:52:26,959 checkpoint.py: 144: Created symlink: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/checkpoint.torch INFO 2021-12-02 18:52:27,071 train.py: 131: All Done! INFO 2021-12-02 18:52:27,071 logger.py: 73: Shutting down loggers... INFO 2021-12-02 18:52:27,072 distributed_launcher.py: 168: All Done! INFO 2021-12-02 18:52:27,072 logger.py: 73: Shutting down loggers... /var/spool/slurm/slurmd/job8462603/slurm_script: line 46: config.DISTRIBUTED.NUM_NODES=2: command not found


Two nodes:

####### overrides: ['hydra.verbose=true', 'config=dummy/quick_gpu_resnet50_simclr', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.DATA.TRAIN.DATA_LIMIT=1000', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10', 'config.CHECKPOINT.DIR=/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462605', 'config.DISTRIBUTED.NUM_NODES=2', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=4', 'config.DISTRIBUTED.RUN_ID=localhost:46357', 'hydra.verbose=true'] /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/initialize.py:67: UserWarning: hydra.experimental.initialize_config_module() is no longer experimental. Use hydra.initialize_config_module(). deprecation_warning( /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/compose.py:18: UserWarning: hydra.experimental.compose() is no longer experimental. Use hydra.compose() deprecation_warning( INFO 2021-12-02 18:53:17,135 train.py: 94: Env set for rank: 1, dist_rank: 1 INFO 2021-12-02 18:53:17,135 train.py: 94: Env set for rank: 3, dist_rank: 3 INFO 2021-12-02 18:53:17,135 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,135 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,135 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,135 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,135 misc.py: 173: MACHINE SEED: 1 INFO 2021-12-02 18:53:17,135 misc.py: 173: MACHINE SEED: 3 INFO 2021-12-02 18:53:17,145 train.py: 94: Env set for rank: 2, dist_rank: 2 INFO 2021-12-02 18:53:17,145 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,145 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,145 misc.py: 173: MACHINE SEED: 2 INFO 2021-12-02 18:53:18,565 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 1 INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 3 INFO 2021-12-02 18:53:18,566 env.py: 50: BASH_ENV: /opt/lmod/lmod/init/bash INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 2 INFO 2021-12-02 18:53:18,567 env.py: 50: BASH_FUNC_ml%%: () { eval $($LMOD_DIR/ml_cmd "$@") } INFO 2021-12-02 18:53:18,567 env.py: 50: BASH_FUNC_module%%: () { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_DEFAULT_ENV: vissl INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/conda INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PREFIX: /home/bdolicki/.conda/envs/vissl INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PREFIX_1: /home/bdolicki/.conda/envs/thesis INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PROMPT_MODIFIER: (vissl) INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PYTHON_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/python INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_SHLVL: 2 INFO 2021-12-02 18:53:18,567 env.py: 50: CUDA_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:53:18,568 env.py: 50: DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/55916/bus INFO 2021-12-02 18:53:18,568 env.py: 50: EBDEVELANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/easybuild/Anaconda3-2021.05-easybuild-devel INFO 2021-12-02 18:53:18,568 env.py: 50: EBROOTANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05 INFO 2021-12-02 18:53:18,568 env.py: 50: EBVERSIONANACONDA3: 2021.05 INFO 2021-12-02 18:53:18,568 env.py: 50: ENVIRONMENT: BATCH INFO 2021-12-02 18:53:18,568 env.py: 50: FPATH: /opt/lmod/lmod/init/ksh_funcs INFO 2021-12-02 18:53:18,568 env.py: 50: GPU_DEVICE_ORDINAL: 0,1,2,3 INFO 2021-12-02 18:53:18,568 env.py: 50: HOME: /home/bdolicki INFO 2021-12-02 18:53:18,568 env.py: 50: HOSTNAME: r29n2 INFO 2021-12-02 18:53:18,568 env.py: 50: LANG: en_US INFO 2021-12-02 18:53:18,568 env.py: 50: LMOD_CASE_INDEPENDENT_SORTING: yes INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_CMD: /opt/lmod/lmod/libexec/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_DIR: /opt/lmod/lmod/libexec INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_EXACT_MATCH: yes INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_PKG: /opt/lmod/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_ROOT: /opt/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_SETTARG_FULL_SUPPORT: no INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_SHORT_TIME: 10000 INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_VERSION: 8.5.22 INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_sys: Linux INFO 2021-12-02 18:53:18,569 env.py: 50: LOADEDMODULES: 2021:Anaconda3/2021.05 INFO 2021-12-02 18:53:18,569 env.py: 50: LOCAL_RANK: 0 INFO 2021-12-02 18:53:18,570 env.py: 50: LOGNAME: bdolicki INFO 2021-12-02 18:53:18,570 env.py: 50: MAIL: /var/mail/bdolicki INFO 2021-12-02 18:53:18,570 env.py: 50: MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:/opt/lmod/lmod/share/man::/opt/slurm/sw/current/share/man INFO 2021-12-02 18:53:18,570 env.py: 50: MODULEPATH: /sw/noarch/modulefiles/environment:/sw/arch/Debian10/EB_production/2021/modulefiles/phys:/sw/arch/Debian10/EB_production/2021/modulefiles/perf:/sw/arch/Debian10/EB_production/2021/modulefiles/geo:/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:/sw/arch/Debian10/EB_production/2021/modulefiles/chem:/sw/arch/Debian10/EB_production/2021/modulefiles/data:/sw/arch/Debian10/EB_production/2021/modulefiles/vis:/sw/arch/Debian10/EB_production/2021/modulefiles/bio:/sw/arch/Debian10/EB_production/2021/modulefiles/math:/sw/arch/Debian10/EB_production/2021/modulefiles/cae:/sw/arch/Debian10/EB_production/2021/modulefiles/system:/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:/sw/arch/Debian10/EB_production/2021/modulefiles/tools:/sw/arch/Debian10/EB_production/2021/modulefiles/lib:/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:/sw/arch/Debian10/EB_production/2021/modulefiles/lang:/sw/arch/Debian10/EB_production/2021/modulefiles/devel:/sw/noarch/Debian10/2021/modulefiles/all INFO 2021-12-02 18:53:18,570 env.py: 50: MODULEPATH_ROOT: /opt/modulefiles INFO 2021-12-02 18:53:18,570 env.py: 50: MODULESHOME: /opt/lmod/lmod INFO 2021-12-02 18:53:18,570 env.py: 50: NCCL_ASYNC_ERROR_HANDLING: 1 INFO 2021-12-02 18:53:18,570 env.py: 50: NCCL_DEBUG: INFO INFO 2021-12-02 18:53:18,570 env.py: 50: OLDPWD: /home/bdolicki/thesis INFO 2021-12-02 18:53:18,570 env.py: 50: PATH: /home/bdolicki/.conda/envs/vissl/bin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:/sw/noarch/Debian10/2021/software/os_binary_wrappers:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/sara/bin:/opt/slurm/bin:/opt/slurm/sbin:/opt/slurm/sw/current/bin INFO 2021-12-02 18:53:18,570 env.py: 50: PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig INFO 2021-12-02 18:53:18,571 env.py: 50: PWD: /home/bdolicki/thesis/hissl INFO 2021-12-02 18:53:18,571 env.py: 50: RANK: 0 INFO 2021-12-02 18:53:18,571 env.py: 50: ROCR_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:53:18,571 env.py: 50: SHELL: /bin/bash INFO 2021-12-02 18:53:18,571 env.py: 50: SHLVL: 2 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURMD_NODENAME: r29n2 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CLUSTER_NAME: lisa INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CONF: /opt/slurm/etc/slurm.conf INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CPUS_ON_NODE: 24 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_GPUS_PER_NODE: titanrtx:4 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_GTIDS: 0 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOBID: 8462605 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_ACCOUNT: bdolicki INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_CPUS_PER_NODE: 24(x2) INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_GID: 55479 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_GPUS: 0,1,2,3 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_ID: 8462605 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NAME: train_nct_dino INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NODELIST: r29n[2,5] INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NUM_NODES: 2 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_PARTITION: gpu_titanrtx_short INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_QOS: default INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_JOB_UID: 55916 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_JOB_USER: bdolicki INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_LOCALID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NNODES: 2 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODEID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODELIST: r29n[2,5] INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODE_ALIASES: (null) INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_PRIO_PROCESS: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_PROCID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_SPANK_SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_SUBMIT_DIR: /home/bdolicki/thesis INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_SUBMIT_HOST: login3.lisa.surfsara.nl INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TASKS_PER_NODE: 24(x2) INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TASK_PID: 27583 INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TOPOLOGY_ADDR: gigabit..gpu.I09_I10_I15_I16_I17_I19.r29n2 INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TOPOLOGY_ADDR_PATTERN: switch.switch.switch.switch.node INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_WORKING_CLUSTER: lisa:batch4.lisa.surfsara.nl:6817:9216:109 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_CLIENT: 86.83.160.29 51594 22 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_CONNECTION: 86.83.160.29 51594 145.101.32.96 22 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_TTY: /dev/pts/13 INFO 2021-12-02 18:53:18,574 env.py: 50: SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:53:18,574 env.py: 50: TAR: /bin/tar INFO 2021-12-02 18:53:18,575 env.py: 50: TERM: xterm-256color INFO 2021-12-02 18:53:18,575 env.py: 50: TMPDIR: /scratch INFO 2021-12-02 18:53:18,575 env.py: 50: USER: bdolicki INFO 2021-12-02 18:53:18,575 env.py: 50: WORLD_SIZE: 8 INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_EXECUTABLE_TRACKING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_GPU_TRACKING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_SAMPLING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_RUNTIME_DIR: /run/user/55916 INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSION_CLASS: user INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSION_ID: c1889 INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSIONTYPE: tty INFO 2021-12-02 18:53:18,575 env.py: 50: : /home/bdolicki/.conda/envs/vissl/bin/python3 INFO 2021-12-02 18:53:18,576 env.py: 50: _CE_CONDA:
INFO 2021-12-02 18:53:18,576 env.py: 50: _CE_M:
INFO 2021-12-02 18:53:18,576 env.py: 50: LMFILES: /sw/noarch/modulefiles/environment/2021.lua:/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable001: X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFtaWx5ID0ge30sCm1UID0gewpbIjIwMjEiXSA9IHsKZm4gPSAiL3N3L25vYXJjaC9tb2R1bGVmaWxlcy9lbnZpcm9ubWVudC8yMDIxLmx1YSIsCmZ1bGxOYW1lID0gIjIwMjEiLApsb2FkT3JkZXIgPSAxLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIjIwMjEiLAp3ViA9ICJNLip6ZmluYWwiLAp9LApBbmFjb25kYTMgPSB7CmZuID0gIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9sYW5nL0FuYWNvbmRhMy8yMDIx INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable002: LjA1Lmx1YSIsCmZ1bGxOYW1lID0gIkFuYWNvbmRhMy8yMDIxLjA1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJBbmFjb25kYTMvMjAyMS4wNSIsCndWID0gIjAwMDAwMjAyMS4wMDAwMDAwMDUuKnpmaW5hbCIsCn0sCn0sCm1wYXRoQSA9IHsKIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9waHlzIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvcGVyZiIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21v INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable003: ZHVsZWZpbGVzL2dlbyIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RlYnVnZ2VyIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2hlbSIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RhdGEiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy92aXMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9iaW8iCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9tYXRoIgosICIvc3cvYXJjaC9EZWJpYW4x INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable004: MC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2FlIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvc3lzdGVtIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbGNoYWluIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbnVtbGliIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbXBpIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbHMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVm INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable005: aWxlcy9saWIiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9jb21waWxlciIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2xhbmciCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9kZXZlbCIsICIvc3cvbm9hcmNoL0RlYmlhbjEwLzIwMjEvbW9kdWxlZmlsZXMvYWxsIiwKfSwKc3lzdGVtQmFzZU1QQVRIID0gIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiLAp9Cg== INFO 2021-12-02 18:53:18,576 env.py: 50: _ModuleTableSz: 5 INFO 2021-12-02 18:53:18,576 env.py: 50: LMOD_REF_COUNT_LOADEDMODULES: 2021:1;Anaconda3/2021.05:1 INFO 2021-12-02 18:53:18,576 env.py: 50: __LMOD_REF_COUNT_MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:1;/opt/lmod/lmod/share/man:1;/opt/slurm/sw/current/share/man:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNT_MODULEPATH: /sw/noarch/modulefiles/environment:1;/sw/arch/Debian10/EB_production/2021/modulefiles/phys:1;/sw/arch/Debian10/EB_production/2021/modulefiles/perf:1;/sw/arch/Debian10/EB_production/2021/modulefiles/geo:1;/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:1;/sw/arch/Debian10/EB_production/2021/modulefiles/chem:1;/sw/arch/Debian10/EB_production/2021/modulefiles/data:1;/sw/arch/Debian10/EB_production/2021/modulefiles/vis:1;/sw/arch/Debian10/EB_production/2021/modulefiles/bio:1;/sw/arch/Debian10/EB_production/2021/modulefiles/math:1;/sw/arch/Debian10/EB_production/2021/modulefiles/cae:1;/sw/arch/Debian10/EB_production/2021/modulefiles/system:1;/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:1;/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:1;/sw/arch/Debian10/EB_production/2021/modulefiles/tools:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang:1;/sw/arch/Debian10/EB_production/2021/modulefiles/devel:1;/sw/noarch/Debian10/2021/modulefiles/all:1 INFO 2021-12-02 18:53:18,577 env.py: 50: __LMOD_REF_COUNT_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:1;/sw/noarch/Debian10/2021/software/os_binary_wrappers:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:1;/usr/bin:1;/bin:1;/usr/bin/X11:1;/usr/games:1;/usr/sara/bin:1;/opt/slurm/bin:1;/opt/slurm/sbin:1;/opt/slurm/sw/current/bin:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNT_PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNTLMFILES_: /sw/noarch/modulefiles/environment/2021.lua:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_SET_FPATH: 1 INFO 2021-12-02 18:53:18,577 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:18,577 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:18,577 misc.py: 173: MACHINE SEED: 0 INFO 2021-12-02 18:53:18,633 hydra_config.py: 132: Training with config: INFO 2021-12-02 18:53:18,639 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462605', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 10, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': 1000, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 2, 'NUM_PROC_PER_NODE': 4, 'RUN_ID': 'localhost:46357'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': False, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 160, 'embedding_dim': 128, 'world_size': 8}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MONITOR_PERF_STATS': True, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 1, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PERF_STAT_FREQUENCY': 10, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'ROLLING_BTIME_FREQ': 5, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-12-02 18:53:19,000 train.py: 117: System config:


sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 2957.695 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23


INFO 2021-12-02 18:53:19,001 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 0 r29n2:27632:27632 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27632:27632 [0] NCCL INFO NET/IB : No device found. r29n2:27632:27632 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27634:27634 [2] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27633:27633 [1] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27635:27635 [3] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27634:27634 [2] NCCL INFO NET/IB : No device found. r29n2:27634:27634 [2] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO Using network Socket slurmstepd: error: JOB 8462605 ON r29n2 CANCELLED AT 2021-12-02T19:53:21 DUE TO TIME LIMIT

4. please simplify the steps as much as possible so they do not require additional resources to
   run, such as a private dataset.
Done - im using a synthetic dataset and the config and script are copied above

## Expected `behavior:`

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior

The expected behaviour is for the multi-node script to finish in less than 1 minute and instead it hangs
## Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py


sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 1000.080 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23



## When to expect Triage

VISSL devs and contributors aim to triage issues asap however, as a general guideline, we ask users to expect triaging in 1-2 weeks.
iseessel commented 2 years ago

Hi @blazejdolicki , glad to hear your using VISSL with SLURM and thank you for all the logs/code.

I would recommend using our slurm launch script if possible. See our docs on it here.

More specifically, I think your issue is due to not specifying the node_id for each node. With SLURM you can either:

  1. Use slurm launch script. (Recommended). This will use submitit. This should automatically take care of node_id.
  2. Set config.SLURM.USE_SLURM=true as in slurm launch script. This will also automatically set the node_id. See here for how this is done.
  3. Set node_id manually, like in past issue here.
    python3 tools/run_distributed_engines.py \
    config.node_id=$NODE_ID_ENV_VARIABLE \ 
    ...
blazejdolicki commented 2 years ago

Thank you for your swift response. Below I've summarized my developments and thought process. I haven't used multi-node setups before so my apologies if the root of the problem is some basic mistake in my understanding of the setup.

Setting config.SLURM.USE_SLURM=true

I managed to get the above simple example working with the second method you proposed i.e. running a shell script in the terminal of my cluster that includes python run_distributed_engines.py <config args> config.SLURM.USE_SLURM=true which submits the job with submitit. Unfortunately, ultimately for my experiments I need to use a singularity container inside SLURM and that still doesn't work. Here's what I think is happening:

Before using SLURM.USE_SLURM=true, I would make a .job file which I would submit to the job scheduler with sbatch inside the terminal of the cluster. While being inside the GPU node, the .job file would be executed so it would run the container and then execute python run_distributed_engines.py <config>. So here I first submit the job and then python run_distributed_engine.py is executed inside the GPU node. However, for the multi-node setting, this order is reversed - I first executed python run_distributed_engines.py <config args> config.SLURM.USE_SLURM=true and then that script schedules a job.

Inside run_distributed_engines.py submitit uses AutoExecutor. This class automatically detects if the job is run in the cluster or locally. If the latter is the case, the LocalExecutor is used. Which explains the error I'm getting when using the 2nd method with a container:

Traceback (most recent call last):
  File "tools/run_distributed_engines_hissl.py", line 73, in <module>
    hydra_main(overrides=overrides)
  File "tools/run_distributed_engines_hissl.py", line 42, in hydra_main
    launch_distributed_on_slurm(engine_name=args.engine_name, cfg=config)
  File "/hissl/third_party/vissl/vissl/utils/distributed_launcher.py", line 255, in launch_distributed_on_slurm
    executor.update_parameters(
  File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/core/core.py", line 670, in update_parameters
    self._internal_update_parameters(**kwargs)
  File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/auto/auto.py", line 208, in _internal_update_parameters
    self._executor._internal_update_parameters(**parameters)
  File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/local/local.py", line 142, in _internal_update_parameters
    raise ValueError("LocalExecutor can use only one node. Use nodes=1")
ValueError: LocalExecutor can use only one node. Use nodes=1

i.e. the script selects LocalExecutor while we want it to select SlurmExecutor. So, in plain English, somehow the container does not realize that it's inside a cluster and that it can schedule jobs to GPU nodes and tries to schedule them locally.

Do you know how to fix that problem?

Setting node_id manually

I think I initially misunderstood this part, because I thought I can just set node_id in the script without any further changes. However, if I don't set USE_SLURM=true (which has problems described in the first paragraph) the script run_distributed_engines.py executes launch_distributed instead of launch_distributed_on_slurm. In the comments of the former, I read:

If more than 1 nodes are needed for training, this function should be called on each
of the different nodes, each time with an unique node_id in the range [0..N-1] if N
is the total number of nodes to take part in training.

This made me realize that just changing the node_id doesn't help. I would have to do call launch_distributed() on every node I'm using, but I don't have a good idea how to do it in practice.

iseessel commented 2 years ago
Unfortunately, ultimately for my experiments I need to use a singularity container inside SLURM and that still doesn't work

Can you expand upon what you mean by this?

Which explains the error I'm getting when using the 2nd method with a container:

Do you also get this error with option 1, using slurm launch script?

This made me realize that just changing the node_id doesn't help. I would have to do call launch_distributed() on every node I'm using, but I don't have a good idea how to do it in practice.

This really is a SLURM question. See e.g. this or the docs. You should then be able to set node_id manually for all nodes.

CC: @QuentinDuval If you have any ideas.

iseessel commented 2 years ago

For this question:

i.e. the script selects LocalExecutor while we want it to select SlurmExecutor. So, in plain English, somehow the container does not realize that it's inside a cluster and that it can schedule jobs to GPU nodes and tries to schedule them locally.

Do you know how to fix that problem?

You could also try replacing AutoExecutor with SlurmExecutor or adding cluster keyword: https://github.com/facebookincubator/submitit/blob/main/submitit/auto/auto.py#L48 (Assuming you are building VISSL from source).

Do you have the SLURM cli installed in your image? If so can you install it? See for ex. https://slurm.schedmd.com/download.html This may also solve the problem.

miriamrebekah commented 2 years ago

@blazejdolicki would you be willing to share your requirements.txt for your environment? I'm trying to set my environment up with 3.8 but only have a 3.7 environment set up right now.

blazejdolicki commented 2 years ago

Sorry for my late reply.

@iseessel thank you for your help, I agree that the question is now more about slurm and submitit than vissl itself, so the issue can be closed

@miriamrebekah I followed the installation instructions described here: https://vissl.ai/#quickstart

iseessel commented 2 years ago

@blazejdolicki Were you able to get it working? Plz let me know if you need any additional support -- would love to get you up and running on cluster with VISSL!