facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.26k stars 334 forks source link

How to run linear evaluation on VOC07? I am getting errors trying to run it. #433

Closed yxchng closed 3 years ago

yxchng commented 3 years ago

I want to run linear evaluation on VOC07 using this script https://github.com/facebookresearch/vissl/blob/main/configs/config/benchmark/linear_image_classification/voc07/eval_alexnet_8gpu_transfer_voc07_svm.yaml. However, it is giving me errors.

Instructions To Reproduce the Issue:

  1. run using the command

    python3 run_distributed_engines.py \
    hydra.verbose=true \
    config=eval_resnet_8gpu_transfer_voc07_svm \
    config.CHECKPOINT.DIR="./checkpoints_voc" \
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE="./new_model.pth.tar" \
    config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk._feature_blocks." \
    config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""
  2. full logs you observed:

    
    Traceback (most recent call last):
    File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
    File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
    File "run_distributed_engines.py", line 112, in launch_distributed
    daemon=False,
    File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
    File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
    File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
    Exception: 

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/data00/yarn/nmdata/usercache/zhoudongyan.daniel/appcache/application_1592202091440_0014/container_e08_1592202091440_0014_10_005042/vissl/run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "/data00/yarn/nmdata/usercache/zhoudongyan.daniel/appcache/application_1592202091440_0014/container_e08_1592202091440_0014_10_005042/vissl/run_distributed_engines.py", line 159, in process_main hook_generator=hook_generator, File "/home/xxx/.local/lib/python3.7/site-packages/vissl/engines/train.py", line 102, in train_main trainer.train() File "/home/xxx/.local/lib/python3.7/site-packages/vissl/trainer/trainer_main.py", line 186, in train task = train_step_fn(task) File "/home/xxx/.local/lib/python3.7/site-packages/vissl/trainer/train_steps/standard_train_step.py", line 154, in standard_train_step local_loss = task.loss(model_output, target) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 962, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2468, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2264, in nll_loss ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15


## Expected behavior:

Run without error

## Environment:

sys.platform linux Python 3.7.3 (default, Jul 25 2020, 13:03:44) [GCC 8.3.0] numpy 1.19.5 Pillow 8.2.0 vissl 0.1.5 @/home/xxx/.local/lib/python3.7/site-packages/vissl GPU available True GPU 0,1,2,3,4,5,6,7 Tesla V100-SXM2-32GB CUDA_HOME /usr/local/cuda torchvision 0.8.2 @/usr/local/lib/python3.7/dist-packages/torchvision hydra 1.0.7 @/home/xxx/.local/lib/python3.7/site-packages/hydra classy_vision 0.6.0.dev @/home/xxx/.local/lib/python3.7/site-packages/classy_vision tensorboard 1.15.0 apex 0.1 @/usr/local/lib/python3.7/dist-packages/apex cv2 3.2.0 PyTorch 1.7.1 @/usr/local/lib/python3.7/dist-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 46 bits physical, 48 bits virtual CPU(s) 96 On-line CPU(s) list 0-95 Thread(s) per core 2 Core(s) per socket 24 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Stepping 7 CPU MHz 3099.992 CPU max MHz 3900.0000 CPU min MHz 1000.0000 BogoMIPS 4800.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 36608K NUMA node0 CPU(s) 0-23,48-71 NUMA node1 CPU(s) 24-47,72-95


prigoyal commented 3 years ago

Hi @yxchng , thank you for reaching out. On VOC07, please use the https://github.com/facebookresearch/vissl/blob/main/tools/train_svm.py instead of run_distributed_engines.py. We also provide documentation on this benchmark here https://vissl.readthedocs.io/en/latest/flowcharts/svm_workflow.html. Hope this helps! :)

yxchng commented 3 years ago

OK. Thanks. It works now.