Closed blazejdolicki closed 2 years ago
Hi @blazejdolicki , glad to hear your using VISSL with SLURM and thank you for all the logs/code.
I would recommend using our slurm launch script if possible. See our docs on it here.
More specifically, I think your issue is due to not specifying the node_id for each node. With SLURM you can either:
config.SLURM.USE_SLURM=true
as in slurm launch script. This will also automatically set the node_id. See here for how this is done. python3 tools/run_distributed_engines.py \
config.node_id=$NODE_ID_ENV_VARIABLE \
...
Thank you for your swift response. Below I've summarized my developments and thought process. I haven't used multi-node setups before so my apologies if the root of the problem is some basic mistake in my understanding of the setup.
I managed to get the above simple example working with the second method you proposed i.e. running a shell script in the terminal of my cluster that includes python run_distributed_engines.py <config args> config.SLURM.USE_SLURM=true
which submits the job with submitit. Unfortunately, ultimately for my experiments I need to use a singularity container inside SLURM and that still doesn't work. Here's what I think is happening:
Before using SLURM.USE_SLURM=true, I would make a .job file which I would submit to the job scheduler with sbatch
inside the terminal of the cluster. While being inside the GPU node, the .job file would be executed so it would run the container and then execute python run_distributed_engines.py <config>
. So here I first submit the job and then python run_distributed_engine.py
is executed inside the GPU node. However, for the multi-node setting, this order is reversed - I first executed python run_distributed_engines.py <config args> config.SLURM.USE_SLURM=true
and then that script schedules a job.
Inside run_distributed_engines.py
submitit uses AutoExecutor
. This class automatically detects if the job is run in the cluster or locally. If the latter is the case, the LocalExecutor
is used. Which explains the error I'm getting when using the 2nd method with a container:
Traceback (most recent call last):
File "tools/run_distributed_engines_hissl.py", line 73, in <module>
hydra_main(overrides=overrides)
File "tools/run_distributed_engines_hissl.py", line 42, in hydra_main
launch_distributed_on_slurm(engine_name=args.engine_name, cfg=config)
File "/hissl/third_party/vissl/vissl/utils/distributed_launcher.py", line 255, in launch_distributed_on_slurm
executor.update_parameters(
File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/core/core.py", line 670, in update_parameters
self._internal_update_parameters(**kwargs)
File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/auto/auto.py", line 208, in _internal_update_parameters
self._executor._internal_update_parameters(**parameters)
File "/users/hissl/miniconda3/lib/python3.8/site-packages/submitit/local/local.py", line 142, in _internal_update_parameters
raise ValueError("LocalExecutor can use only one node. Use nodes=1")
ValueError: LocalExecutor can use only one node. Use nodes=1
i.e. the script selects LocalExecutor while we want it to select SlurmExecutor. So, in plain English, somehow the container does not realize that it's inside a cluster and that it can schedule jobs to GPU nodes and tries to schedule them locally.
Do you know how to fix that problem?
I think I initially misunderstood this part, because I thought I can just set node_id in the script without any further changes. However, if I don't set USE_SLURM=true (which has problems described in the first paragraph) the script run_distributed_engines.py
executes launch_distributed
instead of launch_distributed_on_slurm
. In the comments of the former, I read:
If more than 1 nodes are needed for training, this function should be called on each
of the different nodes, each time with an unique node_id in the range [0..N-1] if N
is the total number of nodes to take part in training.
This made me realize that just changing the node_id doesn't help. I would have to do call launch_distributed()
on every node I'm using, but I don't have a good idea how to do it in practice.
Unfortunately, ultimately for my experiments I need to use a singularity container inside SLURM and that still doesn't work
Can you expand upon what you mean by this?
Which explains the error I'm getting when using the 2nd method with a container:
Do you also get this error with option 1, using slurm launch script?
This made me realize that just changing the node_id doesn't help. I would have to do call launch_distributed() on every node I'm using, but I don't have a good idea how to do it in practice.
This really is a SLURM question. See e.g. this or the docs. You should then be able to set node_id manually for all nodes.
CC: @QuentinDuval If you have any ideas.
For this question:
i.e. the script selects LocalExecutor while we want it to select SlurmExecutor. So, in plain English, somehow the container does not realize that it's inside a cluster and that it can schedule jobs to GPU nodes and tries to schedule them locally.
Do you know how to fix that problem?
You could also try replacing AutoExecutor with SlurmExecutor or adding cluster keyword: https://github.com/facebookincubator/submitit/blob/main/submitit/auto/auto.py#L48 (Assuming you are building VISSL from source).
Do you have the SLURM cli installed in your image? If so can you install it? See for ex. https://slurm.schedmd.com/download.html This may also solve the problem.
@blazejdolicki would you be willing to share your requirements.txt for your environment? I'm trying to set my environment up with 3.8 but only have a 3.7 environment set up right now.
Sorry for my late reply.
@iseessel thank you for your help, I agree that the question is now more about slurm and submitit than vissl itself, so the issue can be closed
@miriamrebekah I followed the installation instructions described here: https://vissl.ai/#quickstart
@blazejdolicki Were you able to get it working? Plz let me know if you need any additional support -- would love to get you up and running on cluster with VISSL!
Instructions To Reproduce the 🐛 Bug:
git diff
) or what code you wrote I'm trying to run an example pretraining script (simclr, synthetic dataset) on SLURM with multi-node setup. It works for single-node, but doesn't with multi-node. I installed vissl with conda according to the instructions in "Get started". Here's the job script "train_nct_conda_vissl.job" that I runNUM_WORKERS=2 NUM_GPUS=1 NUM_TASKS=1 NUM_MACHINES=1 TRAIN=train/ SOURCE=$HOME/thesis/hissl SINGULARITYIMAGE=$HOME/thesis/hissl_20210922_np121_h5py.sif CONFIG_PATH=dummy/quick_gpu_resnet50_simclr LOGS_DIR=hissl-logs EXPERIMENT_DIR=$HOME/thesis/$LOGS_DIR EXPERIMENT_DIR_CONTAINER=/$LOGS_DIR DATA_ROOT=$HOME"/thesis/ssl-histo/data/NCT-CRC-HE-100K"
module load 2021 module load Anaconda3/2021.05 source activate thesis source activate vissl
cd $SOURCE
for multi-machine GPUs: stops the job in case of NCCL ASYNC errors
export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_DEBUG=INFO
to silence this error:
"ERROR: ld.so: object '/sara/tools/xalt/xalt/lib64/libxalt_init.so'
from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored."
unset LD_PRELOAD
python3 tools/run_distributed_engines.py \ hydra.verbose=true \ config=$CONFIG_PATH\ config.DATA.TRAIN.DATA_SOURCES=[synthetic] \ config.DATA.TRAIN.DATA_LIMIT=1000 \ config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10 \ config.CHECKPOINT.DIR=$HOME/thesis/$EXPERIMENT_DIR_CONTAINER/$SLURM_JOB_NAME/checkpoints/$SLURM_JOB_ID \ config.DISTRIBUTED.NUM_NODES=2 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \ config.DISTRIBUTED.RUN_ID=localhost:46357
@package global
config: VERBOSE: False LOG_FREQUENCY: 1 TEST_ONLY: False TEST_MODEL: False SEED_VALUE: 0 MULTI_PROCESSING_METHOD: forkserver MONITOR_PERF_STATS: True PERF_STAT_FREQUENCY: 10 ROLLING_BTIME_FREQ: 5 DATA: NUM_DATALOADER_WORKERS: 5 TRAIN: DATA_SOURCES: [disk_folder] DATASET_NAMES: [dummy_data_folder] BATCHSIZE_PER_REPLICA: 2 LABEL_TYPE: sample_index # just an implementation detail. Label isn't used TRANSFORMS:
wave_type: half
interval_scaling: [rescaled, rescaled] update_interval: step lengths: [0.1, 0.9] # 100ep DISTRIBUTED: BACKEND: nccl NUM_NODES: 1 NUM_PROC_PER_NODE: 1 INIT_METHOD: tcp RUN_ID: auto MACHINE: DEVICE: gpu CHECKPOINT: DIR: "" AUTO_RESUME: True CHECKPOINT_FREQUENCY: 1 OVERWRITE_EXISTING: true
NUMA node3 CPU(s) 3,7,11,15,19,23
INFO 2021-12-02 18:52:05,682 trainer_main.py: 112: Using Distributed init method: tcp://localhost:33241, world_size: 1, rank: 0 r29n2:27102:27102 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27102:27102 [0] NCCL INFO NET/IB : No device found. r29n2:27102:27102 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27102:27224 [0] NCCL INFO Channel 00/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 01/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 02/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 03/32 : 0
NUMA node3 CPU(s) 3,7,11,15,19,23
INFO 2021-12-02 18:53:19,001 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 0 r29n2:27632:27632 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27632:27632 [0] NCCL INFO NET/IB : No device found. r29n2:27632:27632 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27634:27634 [2] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27633:27633 [1] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27635:27635 [3] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27634:27634 [2] NCCL INFO NET/IB : No device found. r29n2:27634:27634 [2] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO Using network Socket slurmstepd: error: JOB 8462605 ON r29n2 CANCELLED AT 2021-12-02T19:53:21 DUE TO TIME LIMIT
####### overrides: ['hydra.verbose=true', 'config=dummy/quick_gpu_resnet50_simclr', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.DATA.TRAIN.DATA_LIMIT=1000', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10', 'config.CHECKPOINT.DIR=/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603', 'hydra.verbose=true'] INFO 2021-12-02 18:52:05,335 distributed_launcher.py: 183: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:33241 INFO 2021-12-02 18:52:05,336 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_ENV: /opt/lmod/lmod/init/bash INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_FUNC_ml%%: () { eval $($LMOD_DIR/ml_cmd "$@") } INFO 2021-12-02 18:52:05,336 env.py: 50: BASH_FUNC_module%%: () { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_DEFAULT_ENV: vissl INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/conda INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_PREFIX: /home/bdolicki/.conda/envs/vissl INFO 2021-12-02 18:52:05,336 env.py: 50: CONDA_PREFIX_1: /home/bdolicki/.conda/envs/thesis INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_PROMPT_MODIFIER: (vissl) INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_PYTHON_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/python INFO 2021-12-02 18:52:05,337 env.py: 50: CONDA_SHLVL: 2 INFO 2021-12-02 18:52:05,337 env.py: 50: CUDA_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:52:05,337 env.py: 50: DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/55916/bus INFO 2021-12-02 18:52:05,337 env.py: 50: EBDEVELANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/easybuild/Anaconda3-2021.05-easybuild-devel INFO 2021-12-02 18:52:05,337 env.py: 50: EBROOTANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: EBVERSIONANACONDA3: 2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: ENVIRONMENT: BATCH INFO 2021-12-02 18:52:05,337 env.py: 50: FPATH: /opt/lmod/lmod/init/ksh_funcs INFO 2021-12-02 18:52:05,337 env.py: 50: GPU_DEVICE_ORDINAL: 0,1,2,3 INFO 2021-12-02 18:52:05,337 env.py: 50: HOME: /home/bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: HOSTNAME: r29n2 INFO 2021-12-02 18:52:05,337 env.py: 50: LANG: en_US INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_CASE_INDEPENDENT_SORTING: yes INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_CMD: /opt/lmod/lmod/libexec/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_DIR: /opt/lmod/lmod/libexec INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_EXACT_MATCH: yes INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_PKG: /opt/lmod/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_ROOT: /opt/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_SETTARG_FULL_SUPPORT: no INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_SHORT_TIME: 10000 INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_VERSION: 8.5.22 INFO 2021-12-02 18:52:05,337 env.py: 50: LMOD_sys: Linux INFO 2021-12-02 18:52:05,337 env.py: 50: LOADEDMODULES: 2021:Anaconda3/2021.05 INFO 2021-12-02 18:52:05,337 env.py: 50: LOCAL_RANK: 0 INFO 2021-12-02 18:52:05,337 env.py: 50: LOGNAME: bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: MAIL: /var/mail/bdolicki INFO 2021-12-02 18:52:05,337 env.py: 50: MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:/opt/lmod/lmod/share/man::/opt/slurm/sw/current/share/man INFO 2021-12-02 18:52:05,337 env.py: 50: MODULEPATH: /sw/noarch/modulefiles/environment:/sw/arch/Debian10/EB_production/2021/modulefiles/phys:/sw/arch/Debian10/EB_production/2021/modulefiles/perf:/sw/arch/Debian10/EB_production/2021/modulefiles/geo:/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:/sw/arch/Debian10/EB_production/2021/modulefiles/chem:/sw/arch/Debian10/EB_production/2021/modulefiles/data:/sw/arch/Debian10/EB_production/2021/modulefiles/vis:/sw/arch/Debian10/EB_production/2021/modulefiles/bio:/sw/arch/Debian10/EB_production/2021/modulefiles/math:/sw/arch/Debian10/EB_production/2021/modulefiles/cae:/sw/arch/Debian10/EB_production/2021/modulefiles/system:/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:/sw/arch/Debian10/EB_production/2021/modulefiles/tools:/sw/arch/Debian10/EB_production/2021/modulefiles/lib:/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:/sw/arch/Debian10/EB_production/2021/modulefiles/lang:/sw/arch/Debian10/EB_production/2021/modulefiles/devel:/sw/noarch/Debian10/2021/modulefiles/all INFO 2021-12-02 18:52:05,337 env.py: 50: MODULEPATH_ROOT: /opt/modulefiles INFO 2021-12-02 18:52:05,337 env.py: 50: MODULESHOME: /opt/lmod/lmod INFO 2021-12-02 18:52:05,337 env.py: 50: NCCL_ASYNC_ERROR_HANDLING: 1 INFO 2021-12-02 18:52:05,337 env.py: 50: NCCL_DEBUG: INFO INFO 2021-12-02 18:52:05,337 env.py: 50: OLDPWD: /home/bdolicki/thesis INFO 2021-12-02 18:52:05,338 env.py: 50: PATH: /home/bdolicki/.conda/envs/vissl/bin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:/sw/noarch/Debian10/2021/software/os_binary_wrappers:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/sara/bin:/opt/slurm/bin:/opt/slurm/sbin:/opt/slurm/sw/current/bin INFO 2021-12-02 18:52:05,338 env.py: 50: PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig INFO 2021-12-02 18:52:05,338 env.py: 50: PWD: /home/bdolicki/thesis/hissl INFO 2021-12-02 18:52:05,338 env.py: 50: RANK: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: ROCR_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:52:05,338 env.py: 50: SHELL: /bin/bash INFO 2021-12-02 18:52:05,338 env.py: 50: SHLVL: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURMD_NODENAME: r29n2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CLUSTER_NAME: lisa INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CONF: /opt/slurm/etc/slurm.conf INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_CPUS_ON_NODE: 24 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_GPUS_PER_NODE: titanrtx:4 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_GTIDS: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOBID: 8462603 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_ACCOUNT: bdolicki INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_CPUS_PER_NODE: 24(x2) INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_GID: 55479 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_GPUS: 0,1,2,3 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_ID: 8462603 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NAME: train_nct_dino INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NODELIST: r29n[2,5] INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_NUM_NODES: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_PARTITION: gpu_titanrtx_short INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_QOS: default INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_UID: 55916 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_JOB_USER: bdolicki INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_LOCALID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NNODES: 2 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODEID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODELIST: r29n[2,5] INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_NODE_ALIASES: (null) INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_PRIO_PROCESS: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_PROCID: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SPANK_SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SUBMIT_DIR: /home/bdolicki/thesis INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_SUBMIT_HOST: login3.lisa.surfsara.nl INFO 2021-12-02 18:52:05,338 env.py: 50: SLURM_TASKS_PER_NODE: 24(x2) INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TASK_PID: 27062 INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TOPOLOGY_ADDR: gigabit..gpu.I09_I10_I15_I16_I17_I19.r29n2 INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_TOPOLOGY_ADDR_PATTERN: switch.switch.switch.switch.node INFO 2021-12-02 18:52:05,339 env.py: 50: SLURM_WORKING_CLUSTER: lisa:batch4.lisa.surfsara.nl:6817:9216:109 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_CLIENT: 86.83.160.29 51594 22 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_CONNECTION: 86.83.160.29 51594 145.101.32.96 22 INFO 2021-12-02 18:52:05,339 env.py: 50: SSH_TTY: /dev/pts/13 INFO 2021-12-02 18:52:05,339 env.py: 50: SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:52:05,339 env.py: 50: TAR: /bin/tar INFO 2021-12-02 18:52:05,339 env.py: 50: TERM: xterm-256color INFO 2021-12-02 18:52:05,339 env.py: 50: TMPDIR: /scratch INFO 2021-12-02 18:52:05,339 env.py: 50: USER: bdolicki INFO 2021-12-02 18:52:05,339 env.py: 50: WORLD_SIZE: 1 INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_EXECUTABLE_TRACKING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_GPU_TRACKING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XALT_SAMPLING: yes INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_RUNTIME_DIR: /run/user/55916 INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSION_CLASS: user INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSION_ID: c1889 INFO 2021-12-02 18:52:05,339 env.py: 50: XDG_SESSIONTYPE: tty INFO 2021-12-02 18:52:05,339 env.py: 50: : /home/bdolicki/.conda/envs/vissl/bin/python3 INFO 2021-12-02 18:52:05,339 env.py: 50: _CE_CONDA:
INFO 2021-12-02 18:52:05,339 env.py: 50: _CE_M:
INFO 2021-12-02 18:52:05,339 env.py: 50: LMFILES: /sw/noarch/modulefiles/environment/2021.lua:/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable001: X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFtaWx5ID0ge30sCm1UID0gewpbIjIwMjEiXSA9IHsKZm4gPSAiL3N3L25vYXJjaC9tb2R1bGVmaWxlcy9lbnZpcm9ubWVudC8yMDIxLmx1YSIsCmZ1bGxOYW1lID0gIjIwMjEiLApsb2FkT3JkZXIgPSAxLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIjIwMjEiLAp3ViA9ICJNLip6ZmluYWwiLAp9LApBbmFjb25kYTMgPSB7CmZuID0gIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9sYW5nL0FuYWNvbmRhMy8yMDIx INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable002: LjA1Lmx1YSIsCmZ1bGxOYW1lID0gIkFuYWNvbmRhMy8yMDIxLjA1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJBbmFjb25kYTMvMjAyMS4wNSIsCndWID0gIjAwMDAwMjAyMS4wMDAwMDAwMDUuKnpmaW5hbCIsCn0sCn0sCm1wYXRoQSA9IHsKIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9waHlzIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvcGVyZiIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21v INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable003: ZHVsZWZpbGVzL2dlbyIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RlYnVnZ2VyIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2hlbSIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RhdGEiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy92aXMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9iaW8iCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9tYXRoIgosICIvc3cvYXJjaC9EZWJpYW4x INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable004: MC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2FlIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvc3lzdGVtIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbGNoYWluIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbnVtbGliIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbXBpIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbHMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVm INFO 2021-12-02 18:52:05,339 env.py: 50: ModuleTable005: aWxlcy9saWIiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9jb21waWxlciIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2xhbmciCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9kZXZlbCIsICIvc3cvbm9hcmNoL0RlYmlhbjEwLzIwMjEvbW9kdWxlZmlsZXMvYWxsIiwKfSwKc3lzdGVtQmFzZU1QQVRIID0gIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiLAp9Cg== INFO 2021-12-02 18:52:05,339 env.py: 50: _ModuleTableSz: 5 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_LOADEDMODULES: 2021:1;Anaconda3/2021.05:1 INFO 2021-12-02 18:52:05,339 env.py: 50: __LMOD_REF_COUNT_MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:1;/opt/lmod/lmod/share/man:1;/opt/slurm/sw/current/share/man:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_MODULEPATH: /sw/noarch/modulefiles/environment:1;/sw/arch/Debian10/EB_production/2021/modulefiles/phys:1;/sw/arch/Debian10/EB_production/2021/modulefiles/perf:1;/sw/arch/Debian10/EB_production/2021/modulefiles/geo:1;/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:1;/sw/arch/Debian10/EB_production/2021/modulefiles/chem:1;/sw/arch/Debian10/EB_production/2021/modulefiles/data:1;/sw/arch/Debian10/EB_production/2021/modulefiles/vis:1;/sw/arch/Debian10/EB_production/2021/modulefiles/bio:1;/sw/arch/Debian10/EB_production/2021/modulefiles/math:1;/sw/arch/Debian10/EB_production/2021/modulefiles/cae:1;/sw/arch/Debian10/EB_production/2021/modulefiles/system:1;/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:1;/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:1;/sw/arch/Debian10/EB_production/2021/modulefiles/tools:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang:1;/sw/arch/Debian10/EB_production/2021/modulefiles/devel:1;/sw/noarch/Debian10/2021/modulefiles/all:1 INFO 2021-12-02 18:52:05,339 env.py: 50: __LMOD_REF_COUNT_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:1;/sw/noarch/Debian10/2021/software/os_binary_wrappers:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:1;/usr/bin:1;/bin:1;/usr/bin/X11:1;/usr/games:1;/usr/sara/bin:1;/opt/slurm/bin:1;/opt/slurm/sbin:1;/opt/slurm/sw/current/bin:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNT_PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_REF_COUNTLMFILES_: /sw/noarch/modulefiles/environment/2021.lua:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua:1 INFO 2021-12-02 18:52:05,339 env.py: 50: LMOD_SET_FPATH: 1 INFO 2021-12-02 18:52:05,340 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:52:05,340 train.py: 105: Setting seed.... INFO 2021-12-02 18:52:05,340 misc.py: 173: MACHINE SEED: 0 INFO 2021-12-02 18:52:05,346 hydra_config.py: 132: Training with config: INFO 2021-12-02 18:52:05,352 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 10, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': 1000, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 1, 'RUN_ID': 'auto'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': False, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 20, 'embedding_dim': 128, 'world_size': 1}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MONITOR_PERF_STATS': True, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 1, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PERF_STAT_FREQUENCY': 10, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'ROLLING_BTIME_FREQ': 5, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-12-02 18:52:05,679 train.py: 117: System config:
sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 1000.127 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23
INFO 2021-12-02 18:52:05,682 trainer_main.py: 112: Using Distributed init method: tcp://localhost:33241, world_size: 1, rank: 0 r29n2:27102:27102 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27102:27102 [0] NCCL INFO NET/IB : No device found. r29n2:27102:27102 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27102:27102 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27102:27224 [0] NCCL INFO Channel 00/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 01/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 02/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 03/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 04/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 05/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 06/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 07/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 08/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 09/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 10/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 11/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 12/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 13/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 14/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 15/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 16/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 17/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 18/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 19/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 20/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 21/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 22/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 23/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 24/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 25/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 26/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 27/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 28/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 29/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 30/32 : 0 r29n2:27102:27224 [0] NCCL INFO Channel 31/32 : 0 r29n2:27102:27224 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-1 r29n2:27102:27224 [0] NCCL INFO Setting affinity for GPU 0 to 111111 r29n2:27102:27224 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer r29n2:27102:27224 [0] NCCL INFO comm 0x147b84001060 rank 0 nranks 1 cudaDev 0 busId 3b000 - Init COMPLETE INFO 2021-12-02 18:52:08,991 trainer_main.py: 130: | initialized host r29n2.lisa.surfsara.nl as rank 0 (0) INFO 2021-12-02 18:52:08,992 train_task.py: 181: Not using Automatic Mixed Precision INFO 2021-12-02 18:52:08,993 train_task.py: 455: Building model.... INFO 2021-12-02 18:52:08,993 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2021-12-02 18:52:08,993 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2021-12-02 18:52:09,666 model_helpers.py: 177: Using SyncBN group size: 1 INFO 2021-12-02 18:52:09,666 model_helpers.py: 192: Converting BN layers to PyTorch SyncBN INFO 2021-12-02 18:52:09,673 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2021-12-02 18:52:09,673 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2021-12-02 18:52:09,722 optimizer_helper.py: 293: Trainable params: 163, Non-Trainable params: 0, Trunk Regularized Parameters: 53, Trunk Unregularized Parameters 106, Head Regularized Parameters: 4, Head Unregularized Parameters: 0 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2021-12-02 18:52:09,723 img_replicate_pil.py: 52: ImgReplicatePil | Using num_times: 2 INFO 2021-12-02 18:52:09,723 img_pil_color_distortion.py: 56: ImgPilColorDistortion | Using strength: 1.0 INFO 2021-12-02 18:52:09,724 ssl_dataset.py: 156: Rank: 0 split: TRAIN Data files: [''] INFO 2021-12-02 18:52:09,724 ssl_dataset.py: 159: Rank: 0 split: TRAIN Label files: [] INFO 2021-12-02 18:52:09,724 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:52:09,724 init.py: 126: Created the Distributed Sampler.... INFO 2021-12-02 18:52:09,724 init.py: 101: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 1000, 'total_size': 1000, 'shuffle': True, 'seed': 0} INFO 2021-12-02 18:52:09,724 init.py: 215: Wrapping the dataloader to async device copies INFO 2021-12-02 18:52:09,724 train_task.py: 384: Building loss... INFO 2021-12-02 18:52:09,726 simclr_info_nce_loss.py: 91: Creating Info-NCE loss on Rank: 0 INFO 2021-12-02 18:52:09,726 trainer_main.py: 268: Training 1 epochs INFO 2021-12-02 18:52:09,726 trainer_main.py: 269: One epoch = 100 iterations. INFO 2021-12-02 18:52:09,726 trainer_main.py: 270: Total 1000 samples in one epoch INFO 2021-12-02 18:52:09,726 trainer_main.py: 276: Total 100 iterations for training INFO 2021-12-02 18:52:10,688 logger.py: 84: Thu Dec 2 18:52:10 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX On | 00000000:3B:00.0 Off | N/A | | 40% 40C P2 67W / 280W | 970MiB / 24220MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX On | 00000000:5E:00.0 Off | N/A | | 40% 33C P8 10W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX On | 00000000:B1:00.0 Off | N/A | | 41% 32C P8 23W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX On | 00000000:D9:00.0 Off | N/A | | 41% 33C P8 19W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 27102 C python3 967MiB | +-----------------------------------------------------------------------------+
INFO 2021-12-02 18:52:13,224 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 0; lr: 0.6; loss: 3.07831; btime(ms): 0; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,351 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 1; lr: 1.02; loss: 2.98483; btime(ms): 3498; eta: 0:05:46; peak_mem(M): 4176; max_iterations: 100; INFO 2021-12-02 18:52:13,480 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 2; lr: 1.44; loss: 2.98822; btime(ms): 1812; eta: 0:02:57; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,609 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 3; lr: 1.86; loss: 3.02691; btime(ms): 1251; eta: 0:02:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,736 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 4; lr: 2.28; loss: 2.97272; btime(ms): 970; eta: 0:01:33; peak_mem(M): 4176; INFO 2021-12-02 18:52:13,874 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 5; lr: 2.7; loss: 2.90045; btime(ms): 801; eta: 0:01:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,004 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 6; lr: 3.12; loss: 3.00003; btime(ms): 691; eta: 0:01:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,139 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 7; lr: 3.54; loss: 2.9386; btime(ms): 611; eta: 0:00:56; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,263 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 8; lr: 3.96; loss: 3.04946; btime(ms): 551; eta: 0:00:50; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,395 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 9; lr: 4.38; loss: 2.93521; btime(ms): 504; eta: 0:00:45; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,527 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 10; lr: 3.59685; loss: 2.9578; btime(ms): 466; eta: 0:00:42; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,662 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 11; lr: 3.58705; loss: 2.9544; btime(ms): 436; eta: 0:00:38; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,793 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 12; lr: 3.55776; loss: 2.97437; btime(ms): 411; eta: 0:00:36; peak_mem(M): 4176; INFO 2021-12-02 18:52:14,918 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 13; lr: 3.50929; loss: 2.94129; btime(ms): 389; eta: 0:00:33; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,051 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 14; lr: 3.44218; loss: 2.93352; btime(ms): 370; eta: 0:00:31; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,182 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 15; lr: 3.35716; loss: 2.94921; btime(ms): 355; eta: 0:00:30; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,316 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 16; lr: 3.25516; loss: 2.95404; btime(ms): 341; eta: 0:00:28; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,444 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 17; lr: 3.13729; loss: 2.87209; btime(ms): 328; eta: 0:00:27; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,571 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 18; lr: 3.00483; loss: 2.97899; btime(ms): 317; eta: 0:00:26; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,705 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 19; lr: 2.85923; loss: 3.27561; btime(ms): 307; eta: 0:00:24; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,833 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 20; lr: 2.70209; loss: 2.87969; btime(ms): 298; eta: 0:00:23; peak_mem(M): 4176; INFO 2021-12-02 18:52:15,965 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 21; lr: 2.5351; loss: 2.99437; btime(ms): 290; eta: 0:00:22; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,096 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 22; lr: 2.36011; loss: 2.98848; btime(ms): 283; eta: 0:00:22; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,223 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 23; lr: 2.17901; loss: 2.95309; btime(ms): 276; eta: 0:00:21; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,348 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 24; lr: 1.99379; loss: 2.96843; btime(ms): 270; eta: 0:00:20; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,477 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 25; lr: 1.80646; loss: 2.93082; btime(ms): 264; eta: 0:00:19; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,603 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 26; lr: 1.61906; loss: 2.93418; btime(ms): 259; eta: 0:00:19; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,737 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 27; lr: 1.43365; loss: 2.96253; btime(ms): 254; eta: 0:00:18; peak_mem(M): 4176; INFO 2021-12-02 18:52:16,865 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 28; lr: 1.25225; loss: 2.96734; btime(ms): 250; eta: 0:00:18; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,004 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 29; lr: 1.07684; loss: 2.95698; btime(ms): 246; eta: 0:00:17; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,139 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 30; lr: 0.90932; loss: 2.95171; btime(ms): 242; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,269 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 31; lr: 0.75154; loss: 2.94174; btime(ms): 239; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,404 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 32; lr: 0.6052; loss: 2.9391; btime(ms): 235; eta: 0:00:16; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,532 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 33; lr: 0.47191; loss: 2.95017; btime(ms): 232; eta: 0:00:15; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,659 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 34; lr: 0.35312; loss: 2.94415; btime(ms): 229; eta: 0:00:15; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,793 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 35; lr: 0.25014; loss: 2.94375; btime(ms): 226; eta: 0:00:14; peak_mem(M): 4176; INFO 2021-12-02 18:52:17,924 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 36; lr: 0.16407; loss: 2.94641; btime(ms): 224; eta: 0:00:14; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,058 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 37; lr: 0.09586; loss: 2.94746; btime(ms): 221; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,183 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 38; lr: 0.04626; loss: 2.93896; btime(ms): 219; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,320 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 39; lr: 0.01581; loss: 2.94771; btime(ms): 216; eta: 0:00:13; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,450 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 40; lr: 0.00484; loss: 2.94257; btime(ms): 214; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,577 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 41; lr: 0.00767; loss: 2.94951; btime(ms): 212; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,710 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 42; lr: 0.01699; loss: 2.9473; btime(ms): 210; eta: 0:00:12; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,843 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 43; lr: 0.03267; loss: 2.94617; btime(ms): 208; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:18,972 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 44; lr: 0.05454; loss: 2.94794; btime(ms): 207; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,103 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 45; lr: 0.08236; loss: 2.93979; btime(ms): 205; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,230 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 46; lr: 0.11583; loss: 2.94335; btime(ms): 203; eta: 0:00:11; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,366 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 47; lr: 0.15458; loss: 2.95152; btime(ms): 202; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,490 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 48; lr: 0.19819; loss: 2.9448; btime(ms): 200; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,624 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 49; lr: 0.24618; loss: 2.94277; btime(ms): 199; eta: 0:00:10; peak_mem(M): 4176; INFO 2021-12-02 18:52:19,731 logger.py: 84: Thu Dec 2 18:52:19 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX On | 00000000:3B:00.0 Off | N/A | | 41% 46C P2 97W / 280W | 3064MiB / 24220MiB | 68% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX On | 00000000:5E:00.0 Off | N/A | | 41% 33C P8 15W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX On | 00000000:B1:00.0 Off | N/A | | 41% 32C P8 22W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX On | 00000000:D9:00.0 Off | N/A | | 41% 33C P8 18W / 280W | 3MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 27102 C python3 3061MiB | +-----------------------------------------------------------------------------+
INFO 2021-12-02 18:52:19,872 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 50; lr: 0.29803; loss: 2.94613; btime(ms): 197; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,005 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 51; lr: 0.35318; loss: 2.94412; btime(ms): 198; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,131 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 52; lr: 0.41101; loss: 2.94424; btime(ms): 197; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,264 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 53; lr: 0.47091; loss: 2.94942; btime(ms): 196; eta: 0:00:09; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,399 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 54; lr: 0.53222; loss: 2.9432; btime(ms): 195; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,539 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 55; lr: 0.59426; loss: 2.94541; btime(ms): 194; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,671 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 56; lr: 0.65636; loss: 2.94043; btime(ms): 193; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,806 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 57; lr: 0.71785; loss: 2.94718; btime(ms): 192; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:20,935 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 58; lr: 0.77805; loss: 2.93593; btime(ms): 191; eta: 0:00:08; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,071 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 59; lr: 0.83631; loss: 2.9267; btime(ms): 189; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,197 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 60; lr: 0.89198; loss: 2.93109; btime(ms): 189; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,328 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 61; lr: 0.94447; loss: 2.9667; btime(ms): 188; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,457 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 62; lr: 0.9932; loss: 2.92827; btime(ms): 187; eta: 0:00:07; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,589 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 63; lr: 1.03763; loss: 2.93745; btime(ms): 186; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,722 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 64; lr: 1.07729; loss: 2.95133; btime(ms): 185; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,856 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 65; lr: 1.11174; loss: 2.86927; btime(ms): 184; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:21,987 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 66; lr: 1.1406; loss: 2.97953; btime(ms): 183; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,113 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 67; lr: 1.16356; loss: 2.86067; btime(ms): 183; eta: 0:00:06; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,244 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 68; lr: 1.18037; loss: 3.0333; btime(ms): 182; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,378 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 69; lr: 1.19084; loss: 3.08738; btime(ms): 181; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,507 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 70; lr: 1.19487; loss: 3.02461; btime(ms): 180; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,635 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 71; lr: 1.1924; loss: 2.90518; btime(ms): 180; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,771 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 72; lr: 1.18346; loss: 2.99066; btime(ms): 179; eta: 0:00:05; peak_mem(M): 4176; INFO 2021-12-02 18:52:22,896 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 73; lr: 1.16816; loss: 3.01944; btime(ms): 178; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,035 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 74; lr: 1.14666; loss: 2.97204; btime(ms): 177; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,171 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 75; lr: 1.11919; loss: 2.93396; btime(ms): 177; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,302 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 76; lr: 1.08605; loss: 2.9222; btime(ms): 176; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,436 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 77; lr: 1.0476; loss: 2.92474; btime(ms): 176; eta: 0:00:04; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,569 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 78; lr: 1.00427; loss: 2.92866; btime(ms): 175; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,702 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 79; lr: 0.95653; loss: 2.92798; btime(ms): 175; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,830 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 80; lr: 0.90489; loss: 2.94017; btime(ms): 174; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:23,962 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 81; lr: 0.84993; loss: 2.90543; btime(ms): 174; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,096 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 82; lr: 0.79223; loss: 2.94644; btime(ms): 173; eta: 0:00:03; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,221 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 83; lr: 0.73244; loss: 2.91061; btime(ms): 173; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,352 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 84; lr: 0.6712; loss: 2.90573; btime(ms): 172; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,491 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 85; lr: 0.60918; loss: 2.97565; btime(ms): 172; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,627 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 86; lr: 0.54706; loss: 2.86883; btime(ms): 171; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,762 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 87; lr: 0.48552; loss: 2.97494; btime(ms): 171; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:24,893 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 88; lr: 0.42522; loss: 2.96146; btime(ms): 170; eta: 0:00:02; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,019 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 89; lr: 0.36683; loss: 2.96612; btime(ms): 170; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,144 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 90; lr: 0.31099; loss: 3.02852; btime(ms): 169; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,268 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 91; lr: 0.25829; loss: 2.87463; btime(ms): 169; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,391 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 92; lr: 0.20932; loss: 2.91708; btime(ms): 168; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,515 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 93; lr: 0.16462; loss: 2.93636; btime(ms): 168; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,641 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 94; lr: 0.12466; loss: 3.03869; btime(ms): 167; eta: 0:00:01; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,765 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 95; lr: 0.08989; loss: 2.86524; btime(ms): 167; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:25,889 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 96; lr: 0.06068; loss: 2.95951; btime(ms): 167; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,013 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 97; lr: 0.03736; loss: 3.05665; btime(ms): 166; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,136 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 98; lr: 0.02018; loss: 3.05182; btime(ms): 166; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,296 log_hooks.py: 277: Rank: 0; [ep: 0] iter: 99; lr: 0.00932; loss: 3.02523; btime(ms): 165; eta: 0:00:00; peak_mem(M): 4176; INFO 2021-12-02 18:52:26,297 trainer_main.py: 214: Meters synced INFO 2021-12-02 18:52:26,297 io.py: 63: Saving data to file: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/metrics.json INFO 2021-12-02 18:52:26,299 io.py: 89: Saved data to file: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/metrics.json INFO 2021-12-02 18:52:26,299 log_hooks.py: 425: [phase: 0] Saving checkpoint to /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603 INFO 2021-12-02 18:52:26,958 checkpoint.py: 131: Saved checkpoint: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/model_final_checkpoint_phase0.torch INFO 2021-12-02 18:52:26,958 checkpoint.py: 140: Creating symlink... INFO 2021-12-02 18:52:26,959 checkpoint.py: 144: Created symlink: /home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462603/checkpoint.torch INFO 2021-12-02 18:52:27,071 train.py: 131: All Done! INFO 2021-12-02 18:52:27,071 logger.py: 73: Shutting down loggers... INFO 2021-12-02 18:52:27,072 distributed_launcher.py: 168: All Done! INFO 2021-12-02 18:52:27,072 logger.py: 73: Shutting down loggers... /var/spool/slurm/slurmd/job8462603/slurm_script: line 46: config.DISTRIBUTED.NUM_NODES=2: command not found
####### overrides: ['hydra.verbose=true', 'config=dummy/quick_gpu_resnet50_simclr', 'config.DATA.TRAIN.DATA_SOURCES=[synthetic]', 'config.DATA.TRAIN.DATA_LIMIT=1000', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10', 'config.CHECKPOINT.DIR=/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462605', 'config.DISTRIBUTED.NUM_NODES=2', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=4', 'config.DISTRIBUTED.RUN_ID=localhost:46357', 'hydra.verbose=true'] /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/initialize.py:67: UserWarning: hydra.experimental.initialize_config_module() is no longer experimental. Use hydra.initialize_config_module(). deprecation_warning( /home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra/experimental/compose.py:18: UserWarning: hydra.experimental.compose() is no longer experimental. Use hydra.compose() deprecation_warning( INFO 2021-12-02 18:53:17,135 train.py: 94: Env set for rank: 1, dist_rank: 1 INFO 2021-12-02 18:53:17,135 train.py: 94: Env set for rank: 3, dist_rank: 3 INFO 2021-12-02 18:53:17,135 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,135 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,135 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,135 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,135 misc.py: 173: MACHINE SEED: 1 INFO 2021-12-02 18:53:17,135 misc.py: 173: MACHINE SEED: 3 INFO 2021-12-02 18:53:17,145 train.py: 94: Env set for rank: 2, dist_rank: 2 INFO 2021-12-02 18:53:17,145 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:17,145 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:17,145 misc.py: 173: MACHINE SEED: 2 INFO 2021-12-02 18:53:18,565 train.py: 94: Env set for rank: 0, dist_rank: 0 INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 1 INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 3 INFO 2021-12-02 18:53:18,566 env.py: 50: BASH_ENV: /opt/lmod/lmod/init/bash INFO 2021-12-02 18:53:18,566 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 2 INFO 2021-12-02 18:53:18,567 env.py: 50: BASH_FUNC_ml%%: () { eval $($LMOD_DIR/ml_cmd "$@") } INFO 2021-12-02 18:53:18,567 env.py: 50: BASH_FUNC_module%%: () { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_DEFAULT_ENV: vissl INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/conda INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PREFIX: /home/bdolicki/.conda/envs/vissl INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PREFIX_1: /home/bdolicki/.conda/envs/thesis INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PROMPT_MODIFIER: (vissl) INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_PYTHON_EXE: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin/python INFO 2021-12-02 18:53:18,567 env.py: 50: CONDA_SHLVL: 2 INFO 2021-12-02 18:53:18,567 env.py: 50: CUDA_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:53:18,568 env.py: 50: DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/55916/bus INFO 2021-12-02 18:53:18,568 env.py: 50: EBDEVELANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/easybuild/Anaconda3-2021.05-easybuild-devel INFO 2021-12-02 18:53:18,568 env.py: 50: EBROOTANACONDA3: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05 INFO 2021-12-02 18:53:18,568 env.py: 50: EBVERSIONANACONDA3: 2021.05 INFO 2021-12-02 18:53:18,568 env.py: 50: ENVIRONMENT: BATCH INFO 2021-12-02 18:53:18,568 env.py: 50: FPATH: /opt/lmod/lmod/init/ksh_funcs INFO 2021-12-02 18:53:18,568 env.py: 50: GPU_DEVICE_ORDINAL: 0,1,2,3 INFO 2021-12-02 18:53:18,568 env.py: 50: HOME: /home/bdolicki INFO 2021-12-02 18:53:18,568 env.py: 50: HOSTNAME: r29n2 INFO 2021-12-02 18:53:18,568 env.py: 50: LANG: en_US INFO 2021-12-02 18:53:18,568 env.py: 50: LMOD_CASE_INDEPENDENT_SORTING: yes INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_CMD: /opt/lmod/lmod/libexec/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_DIR: /opt/lmod/lmod/libexec INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_EXACT_MATCH: yes INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_PKG: /opt/lmod/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_ROOT: /opt/lmod INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_SETTARG_FULL_SUPPORT: no INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_SHORT_TIME: 10000 INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_VERSION: 8.5.22 INFO 2021-12-02 18:53:18,569 env.py: 50: LMOD_sys: Linux INFO 2021-12-02 18:53:18,569 env.py: 50: LOADEDMODULES: 2021:Anaconda3/2021.05 INFO 2021-12-02 18:53:18,569 env.py: 50: LOCAL_RANK: 0 INFO 2021-12-02 18:53:18,570 env.py: 50: LOGNAME: bdolicki INFO 2021-12-02 18:53:18,570 env.py: 50: MAIL: /var/mail/bdolicki INFO 2021-12-02 18:53:18,570 env.py: 50: MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:/opt/lmod/lmod/share/man::/opt/slurm/sw/current/share/man INFO 2021-12-02 18:53:18,570 env.py: 50: MODULEPATH: /sw/noarch/modulefiles/environment:/sw/arch/Debian10/EB_production/2021/modulefiles/phys:/sw/arch/Debian10/EB_production/2021/modulefiles/perf:/sw/arch/Debian10/EB_production/2021/modulefiles/geo:/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:/sw/arch/Debian10/EB_production/2021/modulefiles/chem:/sw/arch/Debian10/EB_production/2021/modulefiles/data:/sw/arch/Debian10/EB_production/2021/modulefiles/vis:/sw/arch/Debian10/EB_production/2021/modulefiles/bio:/sw/arch/Debian10/EB_production/2021/modulefiles/math:/sw/arch/Debian10/EB_production/2021/modulefiles/cae:/sw/arch/Debian10/EB_production/2021/modulefiles/system:/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:/sw/arch/Debian10/EB_production/2021/modulefiles/tools:/sw/arch/Debian10/EB_production/2021/modulefiles/lib:/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:/sw/arch/Debian10/EB_production/2021/modulefiles/lang:/sw/arch/Debian10/EB_production/2021/modulefiles/devel:/sw/noarch/Debian10/2021/modulefiles/all INFO 2021-12-02 18:53:18,570 env.py: 50: MODULEPATH_ROOT: /opt/modulefiles INFO 2021-12-02 18:53:18,570 env.py: 50: MODULESHOME: /opt/lmod/lmod INFO 2021-12-02 18:53:18,570 env.py: 50: NCCL_ASYNC_ERROR_HANDLING: 1 INFO 2021-12-02 18:53:18,570 env.py: 50: NCCL_DEBUG: INFO INFO 2021-12-02 18:53:18,570 env.py: 50: OLDPWD: /home/bdolicki/thesis INFO 2021-12-02 18:53:18,570 env.py: 50: PATH: /home/bdolicki/.conda/envs/vissl/bin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:/sw/noarch/Debian10/2021/software/os_binary_wrappers:/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/usr/sara/bin:/opt/slurm/bin:/opt/slurm/sbin:/opt/slurm/sw/current/bin INFO 2021-12-02 18:53:18,570 env.py: 50: PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig INFO 2021-12-02 18:53:18,571 env.py: 50: PWD: /home/bdolicki/thesis/hissl INFO 2021-12-02 18:53:18,571 env.py: 50: RANK: 0 INFO 2021-12-02 18:53:18,571 env.py: 50: ROCR_VISIBLE_DEVICES: 0,1,2,3 INFO 2021-12-02 18:53:18,571 env.py: 50: SHELL: /bin/bash INFO 2021-12-02 18:53:18,571 env.py: 50: SHLVL: 2 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURMD_NODENAME: r29n2 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CLUSTER_NAME: lisa INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CONF: /opt/slurm/etc/slurm.conf INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_CPUS_ON_NODE: 24 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_GPUS_PER_NODE: titanrtx:4 INFO 2021-12-02 18:53:18,571 env.py: 50: SLURM_GTIDS: 0 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOBID: 8462605 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_ACCOUNT: bdolicki INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_CPUS_PER_NODE: 24(x2) INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_GID: 55479 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_GPUS: 0,1,2,3 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_ID: 8462605 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NAME: train_nct_dino INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NODELIST: r29n[2,5] INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_NUM_NODES: 2 INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_PARTITION: gpu_titanrtx_short INFO 2021-12-02 18:53:18,572 env.py: 50: SLURM_JOB_QOS: default INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_JOB_UID: 55916 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_JOB_USER: bdolicki INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_LOCALID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NNODES: 2 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODEID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODELIST: r29n[2,5] INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_NODE_ALIASES: (null) INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_PRIO_PROCESS: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_PROCID: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_SPANK_SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:53:18,573 env.py: 50: SLURM_SUBMIT_DIR: /home/bdolicki/thesis INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_SUBMIT_HOST: login3.lisa.surfsara.nl INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TASKS_PER_NODE: 24(x2) INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TASK_PID: 27583 INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TOPOLOGY_ADDR: gigabit..gpu.I09_I10_I15_I16_I17_I19.r29n2 INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_TOPOLOGY_ADDR_PATTERN: switch.switch.switch.switch.node INFO 2021-12-02 18:53:18,574 env.py: 50: SLURM_WORKING_CLUSTER: lisa:batch4.lisa.surfsara.nl:6817:9216:109 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_CLIENT: 86.83.160.29 51594 22 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_CONNECTION: 86.83.160.29 51594 145.101.32.96 22 INFO 2021-12-02 18:53:18,574 env.py: 50: SSH_TTY: /dev/pts/13 INFO 2021-12-02 18:53:18,574 env.py: 50: SURF_EXCLUSIVE: 0 INFO 2021-12-02 18:53:18,574 env.py: 50: TAR: /bin/tar INFO 2021-12-02 18:53:18,575 env.py: 50: TERM: xterm-256color INFO 2021-12-02 18:53:18,575 env.py: 50: TMPDIR: /scratch INFO 2021-12-02 18:53:18,575 env.py: 50: USER: bdolicki INFO 2021-12-02 18:53:18,575 env.py: 50: WORLD_SIZE: 8 INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_EXECUTABLE_TRACKING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_GPU_TRACKING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XALT_SAMPLING: yes INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_RUNTIME_DIR: /run/user/55916 INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSION_CLASS: user INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSION_ID: c1889 INFO 2021-12-02 18:53:18,575 env.py: 50: XDG_SESSIONTYPE: tty INFO 2021-12-02 18:53:18,575 env.py: 50: : /home/bdolicki/.conda/envs/vissl/bin/python3 INFO 2021-12-02 18:53:18,576 env.py: 50: _CE_CONDA:
INFO 2021-12-02 18:53:18,576 env.py: 50: _CE_M:
INFO 2021-12-02 18:53:18,576 env.py: 50: LMFILES: /sw/noarch/modulefiles/environment/2021.lua:/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable001: X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFtaWx5ID0ge30sCm1UID0gewpbIjIwMjEiXSA9IHsKZm4gPSAiL3N3L25vYXJjaC9tb2R1bGVmaWxlcy9lbnZpcm9ubWVudC8yMDIxLmx1YSIsCmZ1bGxOYW1lID0gIjIwMjEiLApsb2FkT3JkZXIgPSAxLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIjIwMjEiLAp3ViA9ICJNLip6ZmluYWwiLAp9LApBbmFjb25kYTMgPSB7CmZuID0gIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9sYW5nL0FuYWNvbmRhMy8yMDIx INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable002: LjA1Lmx1YSIsCmZ1bGxOYW1lID0gIkFuYWNvbmRhMy8yMDIxLjA1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJBbmFjb25kYTMvMjAyMS4wNSIsCndWID0gIjAwMDAwMjAyMS4wMDAwMDAwMDUuKnpmaW5hbCIsCn0sCn0sCm1wYXRoQSA9IHsKIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9waHlzIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvcGVyZiIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21v INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable003: ZHVsZWZpbGVzL2dlbyIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RlYnVnZ2VyIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2hlbSIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2RhdGEiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy92aXMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9iaW8iCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9tYXRoIgosICIvc3cvYXJjaC9EZWJpYW4x INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable004: MC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvY2FlIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvc3lzdGVtIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbGNoYWluIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbnVtbGliIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvbXBpIgosICIvc3cvYXJjaC9EZWJpYW4xMC9FQl9wcm9kdWN0aW9uLzIwMjEvbW9kdWxlZmlsZXMvdG9vbHMiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVm INFO 2021-12-02 18:53:18,576 env.py: 50: ModuleTable005: aWxlcy9saWIiCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9jb21waWxlciIKLCAiL3N3L2FyY2gvRGViaWFuMTAvRUJfcHJvZHVjdGlvbi8yMDIxL21vZHVsZWZpbGVzL2xhbmciCiwgIi9zdy9hcmNoL0RlYmlhbjEwL0VCX3Byb2R1Y3Rpb24vMjAyMS9tb2R1bGVmaWxlcy9kZXZlbCIsICIvc3cvbm9hcmNoL0RlYmlhbjEwLzIwMjEvbW9kdWxlZmlsZXMvYWxsIiwKfSwKc3lzdGVtQmFzZU1QQVRIID0gIi9zdy9ub2FyY2gvbW9kdWxlZmlsZXMvZW52aXJvbm1lbnQiLAp9Cg== INFO 2021-12-02 18:53:18,576 env.py: 50: _ModuleTableSz: 5 INFO 2021-12-02 18:53:18,576 env.py: 50: LMOD_REF_COUNT_LOADEDMODULES: 2021:1;Anaconda3/2021.05:1 INFO 2021-12-02 18:53:18,576 env.py: 50: __LMOD_REF_COUNT_MANPATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/share/man:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/man:1;/opt/lmod/lmod/share/man:1;/opt/slurm/sw/current/share/man:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNT_MODULEPATH: /sw/noarch/modulefiles/environment:1;/sw/arch/Debian10/EB_production/2021/modulefiles/phys:1;/sw/arch/Debian10/EB_production/2021/modulefiles/perf:1;/sw/arch/Debian10/EB_production/2021/modulefiles/geo:1;/sw/arch/Debian10/EB_production/2021/modulefiles/debugger:1;/sw/arch/Debian10/EB_production/2021/modulefiles/chem:1;/sw/arch/Debian10/EB_production/2021/modulefiles/data:1;/sw/arch/Debian10/EB_production/2021/modulefiles/vis:1;/sw/arch/Debian10/EB_production/2021/modulefiles/bio:1;/sw/arch/Debian10/EB_production/2021/modulefiles/math:1;/sw/arch/Debian10/EB_production/2021/modulefiles/cae:1;/sw/arch/Debian10/EB_production/2021/modulefiles/system:1;/sw/arch/Debian10/EB_production/2021/modulefiles/toolchain:1;/sw/arch/Debian10/EB_production/2021/modulefiles/numlib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/mpi:1;/sw/arch/Debian10/EB_production/2021/modulefiles/tools:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lib:1;/sw/arch/Debian10/EB_production/2021/modulefiles/compiler:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang:1;/sw/arch/Debian10/EB_production/2021/modulefiles/devel:1;/sw/noarch/Debian10/2021/modulefiles/all:1 INFO 2021-12-02 18:53:18,577 env.py: 50: __LMOD_REF_COUNT_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/sbin:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/bin:1;/sw/noarch/Debian10/2021/software/os_binary_wrappers:1;/sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/condabin:1;/usr/bin:1;/bin:1;/usr/bin/X11:1;/usr/games:1;/usr/sara/bin:1;/opt/slurm/bin:1;/opt/slurm/sbin:1;/opt/slurm/sw/current/bin:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNT_PKG_CONFIG_PATH: /sw/arch/Debian10/EB_production/2021/software/Anaconda3/2021.05/lib/pkgconfig:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_REF_COUNTLMFILES_: /sw/noarch/modulefiles/environment/2021.lua:1;/sw/arch/Debian10/EB_production/2021/modulefiles/lang/Anaconda3/2021.05.lua:1 INFO 2021-12-02 18:53:18,577 env.py: 50: LMOD_SET_FPATH: 1 INFO 2021-12-02 18:53:18,577 misc.py: 161: Set start method of multiprocessing to forkserver INFO 2021-12-02 18:53:18,577 train.py: 105: Setting seed.... INFO 2021-12-02 18:53:18,577 misc.py: 173: MACHINE SEED: 0 INFO 2021-12-02 18:53:18,633 hydra_config.py: 132: Training with config: INFO 2021-12-02 18:53:18,639 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/home/bdolicki/thesis//hissl-logs/train_nct_dino/checkpoints/8462605', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': True, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 5, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 256, 'COLLATE_FUNCTION': 'default_collate', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['imagenet1k_folder'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': [], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 10, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/imagenet1k', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['dummy_data_folder'], 'DATA_LIMIT': 1000, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['synthetic'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': True, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': True, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': False, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 2, 'NUM_PROC_PER_NODE': 4, 'RUN_ID': 'localhost:46357'}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': False, 'PERF_STAT_FREQUENCY': -1, 'ROLLING_BTIME_FREQ': -1}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 5, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 310, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 1, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embeddingdim': 8192, 'lambda': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 160, 'embedding_dim': 128, 'world_size': 8}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': False}, 'AMP_PARAMS': {'AMP_ARGS': {'keep_batchnorm_fp32': True, 'loss_scale': 'dynamic', 'master_weights': True, 'opt_level': 'O3'}, 'AMP_TYPE': 'apex', 'USE_AMP': False}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 2048], 'use_relu': True}], ['mlp', {'dims': [2048, 128]}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'pytorch'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MONITOR_PERF_STATS': True, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': False, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 1, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}, 'lr_head': {'auto_lr_scaling': {'auto_scale': False, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0, 'interval_scaling': ['rescaled', 'rescaled'], 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'composite', 'schedulers': [{'end_value': 4.8, 'name': 'linear', 'start_value': 0.6}, {'end_value': 0.0048, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 4.8, 'wave_type': 'full'}], 'start_value': 0.1, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001]}}, 'regularize_bias': True, 'regularize_bn': False, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PERF_STAT_FREQUENCY': 10, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'ROLLING_BTIME_FREQ': 5, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': False, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': False} INFO 2021-12-02 18:53:19,000 train.py: 117: System config:
sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 2957.695 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23
INFO 2021-12-02 18:53:19,001 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46357, world_size: 8, rank: 0 r29n2:27632:27632 [0] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27632:27632 [0] NCCL INFO NET/IB : No device found. r29n2:27632:27632 [0] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27632:27632 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2 r29n2:27634:27634 [2] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27633:27633 [1] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27635:27635 [3] NCCL INFO Bootstrap : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation r29n2:27634:27634 [2] NCCL INFO NET/IB : No device found. r29n2:27634:27634 [2] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27634:27634 [2] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/IB : No device found. r29n2:27635:27635 [3] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27635:27635 [3] NCCL INFO Using network Socket r29n2:27633:27633 [1] NCCL INFO NET/Socket : Using [0]admin0:145.101.32.23<0> r29n2:27633:27633 [1] NCCL INFO Using network Socket slurmstepd: error: JOB 8462605 ON r29n2 CANCELLED AT 2021-12-02T19:53:21 DUE TO TIME LIMIT
wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py
sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.2 Pillow 8.4.0 vissl 0.1.6 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1,2,3 TITAN RTX CUDA_HOME None torchvision 0.8.0a0 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.7.1 @/home/bdolicki/.conda/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False
PyTorch built with:
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 24 On-line CPU(s) list 0-23 Thread(s) per core 1 Core(s) per socket 12 Socket(s) 2 NUMA node(s) 4 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz Stepping 4 CPU MHz 1000.080 BogoMIPS 4600.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 16896K NUMA node0 CPU(s) 0,4,8,12,16,20 NUMA node1 CPU(s) 1,5,9,13,17,21 NUMA node2 CPU(s) 2,6,10,14,18,22 NUMA node3 CPU(s) 3,7,11,15,19,23