facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

How to register imagenet1k with prebuilt vissl #546

Closed VicaYang closed 2 years ago

VicaYang commented 2 years ago

If you do not know the root cause of the problem, and wish someone to help you, please post according to this template:

Instructions To Reproduce the Issue:

Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:

  1. full code you wrote or full changes you made (git diff) I copy tools/run_distributed_engines.py and configs in the folder and modify configs/config/dataset_catalog.json
    diff --git a/../vissl/configs/config/dataset_catalog.json b/configs/config/dataset_catalog.json
    index 57e9dd4..dd6fa34 100644
    --- a/../vissl/configs/config/dataset_catalog.json
    +++ b/configs/config/dataset_catalog.json
    @@ -4,8 +4,8 @@
         "val": ["airstore://flashblade_imagenet_val", "<unused>"]
     },
     "imagenet1k_folder": {
    -        "train": ["<img_path>", "<lbl_path>"],
    -        "val": ["<img_path>", "<lbl_path>"]
    +        "train": ["/data/vica/fastdisk/ILSVRC2012/train", "/data/vica/fastdisk/ILSVRC2012/train"],
    +        "val": ["/data/vica/fastdisk/ILSVRC2012/val", "/data/vica/fastdisk/ILSVRC2012/val"]
     },
     "imagenet_a_filelist": {
         "train": ["<not_used>", "<not_used>"],
  2. what exact command you run:
    python tools/run_distributed_engines.py \
    hydra.verbose=true \
    config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear \
    config.CHECKPOINT.DIR="supervised_RN50" \
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE="../weights/resnet50-19c8e357.pth" \
    config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk.base_model._feature_blocks." \
    config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME="" 
  3. full logs you observed: I remove some log of loading weights and verbose of config because it exceeds the limits of 65536 characters.
    
    ####### overrides: ['hydra.verbose=true', 'config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear', 'config.DATA.TRAIN.DATA_PATHS=[/data/vica/fastdisk/ILSVRC2012/train]', 'config.DATA.TEST.DATA_PATHS=[/data/vica/fastdisk/ILSVRC2012/val]', 'config.CHECKPOINT.DIR=supervised_RN50', 'config.MODEL.WEIGHTS_INIT.PARAMS_FILE=../weights/resnet50-19c8e357.pth', 'config.MODEL.WEIGHTS_INIT.APPEND_PREFIX=trunk.base_model._feature_blocks.', 'config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=']
    INFO 2022-05-12 09:02:58,411 train.py:  94: Env set for rank: 2, dist_rank: 2
    INFO 2022-05-12 09:02:58,412 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,412 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,412 misc.py: 173: MACHINE SEED: 84
    INFO 2022-05-12 09:02:58,423 train.py:  94: Env set for rank: 5, dist_rank: 5
    INFO 2022-05-12 09:02:58,423 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,423 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,423 misc.py: 173: MACHINE SEED: 168
    INFO 2022-05-12 09:02:58,424 train.py:  94: Env set for rank: 6, dist_rank: 6
    INFO 2022-05-12 09:02:58,424 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,424 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,424 misc.py: 173: MACHINE SEED: 196
    INFO 2022-05-12 09:02:58,440 train.py:  94: Env set for rank: 7, dist_rank: 7
    INFO 2022-05-12 09:02:58,440 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,440 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,440 misc.py: 173: MACHINE SEED: 224
    INFO 2022-05-12 09:02:58,447 train.py:  94: Env set for rank: 1, dist_rank: 1
    INFO 2022-05-12 09:02:58,447 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,447 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,447 misc.py: 173: MACHINE SEED: 56
    INFO 2022-05-12 09:02:58,457 train.py:  94: Env set for rank: 0, dist_rank: 0
    INFO 2022-05-12 09:02:58,457 env.py:  50: BROWSER:  /home/vica/.vscode-server/bin/8908a9ca0f221f36507231afb39d2d8d1e182702/bin/helpers/browser.sh
    INFO 2022-05-12 09:02:58,457 env.py:  50: COLORTERM:    truecolor
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_DEFAULT_ENV:    vissl
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_EXE:    /home/vica/anaconda3/bin/conda
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_PREFIX: /home/vica/anaconda3/envs/vissl
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_PREFIX_1:   /home/vica/anaconda3
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_PROMPT_MODIFIER:    (vissl) 
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_PYTHON_EXE: /home/vica/anaconda3/bin/python
    INFO 2022-05-12 09:02:58,457 env.py:  50: CONDA_SHLVL:  2
    INFO 2022-05-12 09:02:58,457 env.py:  50: CUDA_VISIBLE_DEVICES: 2,3,4,5,6,7,8,9
    INFO 2022-05-12 09:02:58,457 env.py:  50: DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/1001/bus
    INFO 2022-05-12 09:02:58,457 env.py:  50: GOODLS_APIKEY:    AIzaSyDxRdqDutdNqVGHrMR1oHCw7czaJHUcdrQ
    INFO 2022-05-12 09:02:58,457 env.py:  50: HOME: /home/vica
    INFO 2022-05-12 09:02:58,458 env.py:  50: LANG: en_US.UTF-8
    INFO 2022-05-12 09:02:58,458 env.py:  50: LESSCLOSE:    /usr/bin/lesspipe %s %s
    INFO 2022-05-12 09:02:58,458 env.py:  50: LESSOPEN: | /usr/bin/lesspipe %s
    INFO 2022-05-12 09:02:58,458 env.py:  50: LOCAL_RANK:   0
    INFO 2022-05-12 09:02:58,458 env.py:  50: LOGNAME:  vica
    INFO 2022-05-12 09:02:58,458 env.py:  50: LS_COLORS:    rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
    INFO 2022-05-12 09:02:58,458 env.py:  50: MAIL: /var/mail/vica
    INFO 2022-05-12 09:02:58,458 env.py:  50: PATH: /home/vica/bin:/home/vica/.local/bin:/home/vica/perl5/bin:/home/vica/anaconda3/envs/vissl/bin:/home/vica/perl5/bin:/home/vica/anaconda3/envs/vissl/bin:/home/vica/anaconda3/condabin:/home/vica/.vscode-server/bin/dfd34e8260c270da74b5c2d86d61aee4b6d56977/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/vica/.local/bin:/home/vica/.local/bin
    INFO 2022-05-12 09:02:58,458 env.py:  50: PERL5LIB: /home/vica/perl5/lib/perl5:/home/vica/perl5/lib/perl5:/home/vica/perl5/lib/perl5
    INFO 2022-05-12 09:02:58,458 env.py:  50: PERL_LOCAL_LIB_ROOT:  /home/vica/perl5:/home/vica/perl5:/home/vica/perl5
    INFO 2022-05-12 09:02:58,458 env.py:  50: PERL_MB_OPT:  --install_base "/home/vica/perl5"
    INFO 2022-05-12 09:02:58,458 env.py:  50: PERL_MM_OPT:  INSTALL_BASE=/home/vica/perl5
    INFO 2022-05-12 09:02:58,458 env.py:  50: PWD:  /data/vica/feature/exp
    INFO 2022-05-12 09:02:58,458 env.py:  50: RANK: 0
    INFO 2022-05-12 09:02:58,458 env.py:  50: SHELL:    /bin/bash
    INFO 2022-05-12 09:02:58,458 env.py:  50: SHLVL:    5
    INFO 2022-05-12 09:02:58,458 env.py:  50: SSH_CLIENT:   59.66.17.55 10987 22
    INFO 2022-05-12 09:02:58,458 env.py:  50: SSH_CONNECTION:   166.111.81.74 6738 192.168.0.3 22
    INFO 2022-05-12 09:02:58,458 env.py:  50: TERM: screen
    INFO 2022-05-12 09:02:58,458 env.py:  50: TERM_PROGRAM: vscode
    INFO 2022-05-12 09:02:58,458 env.py:  50: TERM_PROGRAM_VERSION: 1.65.1
    INFO 2022-05-12 09:02:58,458 env.py:  50: TMUX: /tmp/tmux-1001/default,19128,3
    INFO 2022-05-12 09:02:58,458 env.py:  50: TMUX_PANE:    %6
    INFO 2022-05-12 09:02:58,458 env.py:  50: TORCH_HOME:   ~/.torch
    INFO 2022-05-12 09:02:58,458 env.py:  50: USER: vica
    INFO 2022-05-12 09:02:58,459 env.py:  50: VSCODE_IPC_HOOK_CLI:  /run/user/1001/vscode-ipc-8ecaa714-547f-47d5-b813-53778ba6ed80.sock
    INFO 2022-05-12 09:02:58,459 env.py:  50: WORLD_SIZE:   8
    INFO 2022-05-12 09:02:58,459 env.py:  50: XDG_DATA_DIRS:    /usr/local/share:/usr/share:/var/lib/snapd/desktop
    INFO 2022-05-12 09:02:58,459 env.py:  50: XDG_RUNTIME_DIR:  /run/user/1001
    INFO 2022-05-12 09:02:58,459 env.py:  50: XDG_SESSION_ID:   21
    INFO 2022-05-12 09:02:58,459 env.py:  50: _:    /home/vica/anaconda3/envs/vissl/bin/python
    INFO 2022-05-12 09:02:58,459 env.py:  50: _CE_CONDA:    
    INFO 2022-05-12 09:02:58,459 env.py:  50: _CE_M:    
    INFO 2022-05-12 09:02:58,459 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,459 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,459 misc.py: 173: MACHINE SEED: 28
    INFO 2022-05-12 09:02:58,459 train.py:  94: Env set for rank: 3, dist_rank: 3
    INFO 2022-05-12 09:02:58,459 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,459 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,459 misc.py: 173: MACHINE SEED: 112
    INFO 2022-05-12 09:02:58,468 train.py:  94: Env set for rank: 4, dist_rank: 4
    INFO 2022-05-12 09:02:58,468 misc.py: 161: Set start method of multiprocessing to forkserver
    INFO 2022-05-12 09:02:58,468 train.py: 105: Setting seed....
    INFO 2022-05-12 09:02:58,468 misc.py: 173: MACHINE SEED: 140
    INFO 2022-05-12 09:03:01,262 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 2
    INFO 2022-05-12 09:03:01,337 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 6
    INFO 2022-05-12 09:03:01,448 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 3
    INFO 2022-05-12 09:03:01,578 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 7
    INFO 2022-05-12 09:03:01,588 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 5
    INFO 2022-05-12 09:03:01,595 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 1
    INFO 2022-05-12 09:03:01,602 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 4
    INFO 2022-05-12 09:03:01,602 hydra_config.py: 132: Training with config:
    INFO 2022-05-12 09:03:02,204 train.py: 117: System config:
    -------------------  ------------------------------------------------------------------------------------
    sys.platform         linux
    Python               3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
    numpy                1.21.5
    Pillow               9.0.1
    vissl                0.1.6 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl
    GPU available        True
    GPU 0,1,2,3,4,5,6,7  NVIDIA GeForce RTX 2080 Ti
    CUDA_HOME            /usr/local/cuda
    torchvision          0.9.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torchvision
    hydra                1.1.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/hydra
    classy_vision        0.7.0.dev @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/classy_vision
    apex                 0.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/apex
    PyTorch              1.8.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch
    PyTorch debug build  False
    -------------------  ------------------------------------------------------------------------------------
    PyTorch built with:
    - GCC 7.3
    - C++ Version: 201402
    - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
    - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
    - OpenMP 201511 (a.k.a. OpenMP 4.5)
    - NNPACK is enabled
    - CPU capability usage: AVX2
    - CUDA Runtime 10.2
    - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
    - CuDNN 7.6.5
    - Magma 2.5.2
    - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 56 On-line CPU(s) list 0-55 Thread(s) per core 2 Core(s) per socket 14 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping 1 CPU MHz 1202.994 CPU max MHz 3300.0000 CPU min MHz 1200.0000 BogoMIPS 4800.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 35840K NUMA node0 CPU(s) 0-13,28-41 NUMA node1 CPU(s) 14-27,42-55


INFO 2022-05-12 09:03:02,206 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 0 INFO 2022-05-12 09:03:02,264 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 2 INFO 2022-05-12 09:03:02,339 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 6 INFO 2022-05-12 09:03:02,450 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 3 INFO 2022-05-12 09:03:02,580 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 7 INFO 2022-05-12 09:03:02,590 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 5 INFO 2022-05-12 09:03:02,596 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 1 INFO 2022-05-12 09:03:02,604 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 4 INFO 2022-05-12 09:03:02,612 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-05-12 09:03:02,612 trainer_main.py: 130: | initialized host test as rank 0 (0) INFO 2022-05-12 09:03:02,615 trainer_main.py: 130: | initialized host test as rank 4 (4) INFO 2022-05-12 09:03:02,615 trainer_main.py: 130: | initialized host test as rank 3 (3) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 2 (2) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 6 (6) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 1 (1) INFO 2022-05-12 09:03:02,621 trainer_main.py: 130: | initialized host test as rank 5 (5) INFO 2022-05-12 09:03:02,622 trainer_main.py: 130: | initialized host test as rank 7 (7) INFO 2022-05-12 09:03:11,236 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,237 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,237 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,237 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,237 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,240 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,241 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,241 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,241 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,241 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,245 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,245 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,246 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,246 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,246 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,246 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,246 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,247 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,247 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,247 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,248 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,249 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,249 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,249 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,249 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,250 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,250 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,249 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,250 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,250 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,250 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,250 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,251 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,252 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,252 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,252 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,253 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,253 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,253 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,254 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,920 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,922 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,923 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,925 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,925 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,955 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,956 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,978 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:12,368 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,368 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,369 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 4 INFO 2022-05-12 09:03:12,389 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,389 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,390 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 2 INFO 2022-05-12 09:03:12,401 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,401 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,401 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 3 INFO 2022-05-12 09:03:12,412 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,413 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,413 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 6 INFO 2022-05-12 09:03:12,415 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,415 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,416 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 1 INFO 2022-05-12 09:03:12,425 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,426 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,426 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 0 INFO 2022-05-12 09:03:12,434 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,434 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,434 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 5 INFO 2022-05-12 09:03:12,442 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,442 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,442 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 7 INFO 2022-05-12 09:03:12,454 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,454 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,454 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,454 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,455 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,458 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 276: Attempting to load checkpoint from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,465 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,465 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,465 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,465 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,466 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,466 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,466 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,467 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,472 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,472 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,473 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,473 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,689 util.py: 281: Loaded checkpoint from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,689 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:16,755 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:16,760 checkpoint.py: 885: Loaded: trunk.base_model._feature_blocks.conv1.weight of shape: torch.Size([64, 3, 7, 7]) from checkpoint

INFO 2022-05-12 09:03:16,841 checkpoint.py: 885: Loaded: trunk.base_model._feature_blocks.layer4.2.bn3.running_var of shape: torch.Size([2048]) from checkpoint INFO 2022-05-12 09:03:16,841 checkpoint.py: 851: Ignored layer: trunk.base_model._feature_blocks.layer4.2.bn3.num_batches_tracked INFO 2022-05-12 09:03:16,841 checkpoint.py: 894: Not found: heads.0.channel_bn.weight, not initialized INFO 2022-05-12 09:03:16,843 checkpoint.py: 894: Not found: heads.4.clf.clf.0.bias, not initialized INFO 2022-05-12 09:03:16,843 checkpoint.py: 901: Extra layers not loaded from checkpoint: ['trunk.base_model._feature_blocks.fc.weight', 'trunk.base_model._feature_blocks.fc.bias', 'trunk.base_model._feature_blocks.type'] INFO 2022-05-12 09:03:16,844 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:16,844 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:16,970 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:16,974 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:16,986 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:17,000 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,072 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:17,084 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:17,085 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,085 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:17,085 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,091 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:17,186 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,186 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,202 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,202 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,300 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,333 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,344 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,345 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,362 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,362 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,370 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,378 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,379 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,394 train_task.py: 656: Broadcast model BN buffers from primary on every forward pass INFO 2022-05-12 09:03:17,395 classification_task.py: 387: Synchronized Batch Normalization is disabled INFO 2022-05-12 09:03:17,527 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,545 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,592 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 INFO 2022-05-12 09:03:17,600 optimizer_helper.py: 293: Trainable params: 20, Non-Trainable params: 0, Trunk Regularized Parameters: 0, Trunk Unregularized Parameters 0, Head Regularized Parameters: 10, Head Unregularized Parameters: 10 Remaining Regularized Parameters: 0 Remaining Unregularized Parameters: 0 Traceback (most recent call last): File "tools/run_distributed_engines.py", line 57, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 42, in hydra_main launch_distributed( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/utils/distributed_launcher.py", line 135, in launch_distributed torch.multiprocessing.spawn( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/dataset_catalog.py", line 83, in get info = VisslDatasetCatalog.__REGISTERED_DATASETS[name] KeyError: 'imagenet1k_folder' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/utils/distributed_launcher.py", line 192, in _distributed_worker run_engine( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/engines/engine_registry.py", line 86, in run_engine engine.run_engine( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/engines/train.py", line 39, in run_engine train_main( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/engines/train.py", line 130, in train_main trainer.train() File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/trainer/trainer_main.py", line 162, in train self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY) File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/trainer/train_task.py", line 740, in prepare self.datasets, self.data_and_label_keys = self.build_datasets( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/trainer/train_task.py", line 307, in build_datasets datasets[split.lower()] = build_dataset( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/__init__.py", line 69, in build_dataset return GenericSSLDataset( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/ssl_dataset.py", line 98, in __init__ self._get_data_files(split) File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/ssl_dataset.py", line 152, in _get_data_files self.data_paths, self.label_paths = dataset_catalog.get_data_files( File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/dataset_catalog.py", line 293, in get_data_files label_data_info = VisslDatasetCatalog.get(data_names[idx]) File "/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl/data/dataset_catalog.py", line 85, in get raise KeyError( KeyError: "Dataset 'imagenet1k_folder' is not registered! Available datasets are: " ``` ## Expected behavior: Run the benchmark correctly I have tried to run `python -c "from vissl.data.dataset_catalog import VisslDatasetCatalog; print(VisslDatasetCatalog.list()); print(VisslDatasetCatalog.get('imagenet1k_folder'))"` and got ``` ['airstore_imagenet', 'imagenet1k_folder', 'imagenet_a_filelist', 'imagenet_r_filelist', 'imagenette_160_folder', 'places205_folder', 'places365_folder', 'CIFAR10', 'CIFAR100', 'STL10', 'MNIST', 'SVHN', 'inaturalist2018_filelist', 'yfcc100m', 'aircrafts_folder', 'caltech101_folder', 'clevr_count_filelist', 'clevr_dist_filelist', 'dsprites_loc_folder', 'dsprites_orient_folder', 'dtd_folder', 'euro_sat_folder', 'food101_folder', 'gtsrb_folder', 'kitti_dist_folder', 'oxford_flowers_folder', 'oxford_pets_folder', 'pcam_folder', 'small_norb_azimuth_folder', 'small_norb_elevation_folder', 'stanford_cars_folder', 'sun397_filelist', 'ucf101_folder', 'kinetics700_frames_folder', 'coco_folder', 'imagenet1k-per01', 'imagenet1k-per10', 'google-imagenet1k-per01', 'google-imagenet1k-per10'] {'train': ['/data/vica/fastdisk/ILSVRC2012/train', '/data/vica/fastdisk/ILSVRC2012/train'], 'val': ['/data/vica/fastdisk/ILSVRC2012/val', '/data/vica/fastdisk/ILSVRC2012/val']} ``` It seems that the `configs/config/dataset_catalog.json` is recognized correctly. ## Environment: Provide your environment information using the following command: ``` ------------------- ------------------------------------------------------------------------------------ sys.platform linux Python 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] numpy 1.21.5 Pillow 9.0.1 vissl 0.1.6 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/vissl GPU available True GPU 0,1 NVIDIA GeForce RTX 3090 GPU 2,3,4,5,6,7,8,9 NVIDIA GeForce RTX 2080 Ti CUDA_HOME /usr/local/cuda torchvision 0.9.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torchvision hydra 1.1.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/hydra classy_vision 0.7.0.dev @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/classy_vision apex 0.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/apex PyTorch 1.8.1 @/home/vica/anaconda3/envs/vissl/lib/python3.8/site-packages/torch PyTorch debug build False ------------------- ------------------------------------------------------------------------------------ PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683) - OpenMP 201511 (a.k.a. OpenMP 4.5) - NNPACK is enabled - CPU capability usage: AVX2 - CUDA Runtime 10.2 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37 - CuDNN 7.6.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, CPU info: ------------------- ----------------------------------------- Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 56 On-line CPU(s) list 0-55 Thread(s) per core 2 Core(s) per socket 14 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping 1 CPU MHz 1200.327 CPU max MHz 3300.0000 CPU min MHz 1200.0000 BogoMIPS 4800.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 35840K NUMA node0 CPU(s) 0-13,28-41 NUMA node1 CPU(s) 14-27,42-55 ------------------- ----------------------------------------- ```
VicaYang commented 2 years ago

I move run_distributed_engines.py out of the folder tools and it works now, but I am still wondering whether I can place the dataset_catalog.json on some location so that the prebuilt vissl can load it anyway when other tries failed

QuentinDuval commented 2 years ago

Hi @VicaYang,

First of all, thanks a lot for using VISSL and reporting this!

I tried your example and got the same issue you had (actually, I am not sure I got exactly the same case, as I am not sure exactly to understand where the dataset_catalog.json is in the filesystem in the test case your reported - in my case, I created a configs/config/dataset_catalog.json file right next to thetools folder).

Here is a way to deal with this issue temporarily while I debug it further:

You can use the environment variable VISSL_DATASET_CATALOG_PATH to point to your own dataset catalog (the one holding the paths to imagenet1k in your case) like so:

VISSL_DATASET_CATALOG_PATH=configs/config/dataset_catalog.json  python tools/run_distributed_engines.py \
    config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear \
    config.CHECKPOINT.DIR="..." \
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE="..."

In my case, this solved the issue. Could you please try it and report what you got?

Thank you, Quentin

VicaYang commented 2 years ago

Thank @QuentinDuval for your help. I paste the tree result below to help any others who meet similar issues.

.
├── configs
│   ├── config
│   │   ├── benchmark
│   │   ├── dataset_catalog.json
│   │   ├── debugging
│   │   ├── extract_cluster
│   │   ├── feature_extraction
│   │   ├── __init__.py
│   │   ├── model_zoo
│   │   ├── pretrain
│   │   └── test
│   ├── __init__.py
│   └── __pycache__
│       └── __init__.cpython-38.pyc
├── run_distributed_engines.py
└── tools
    ├── cluster_assignments_to_dataset.py
    ├── cluster_features_and_label.py
    ├── __init__.py
    ├── instance_retrieval_test.py
    ├── launch_benchmark_suite_scheduler_slurm.py
    ├── nearest_neighbor_test.py
    ├── object_detection_benchmark.py
    ├── perf_measurement
    │   ├── benchmark_data.py
    │   ├── benchmark_transforms.py
    │   ├── __init__.py
    │   └── README.md
    ├── run_distributed_engines.py
    ├── train_svm_low_shot.py
    └── train_svm.py

Under this folder structure, running python tools/run_distributed_engines.py cannot load the datasets correctly, while python run_distributed_engines.py and VISSL_DATASET_CATALOG_PATH=configs/config/dataset_catalog.json python tools/run_distributed_engines.py can. I believed that using VISSL_DATASET_CATALOG_PATH is the best choice.