facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

Facing RuntimeError: with nearest_neighbour_test.py #519

Closed DC95 closed 2 years ago

DC95 commented 2 years ago

Hii VISSL team!!

Introduction -

I have earlier used KNN from the VISSL package, for example, you can see one issue that I raised in the past Link

My target -

is again the same to get K nearest neighbors to the test image

Error -

For the last few days, I am trying to use nearest_neighbor_test.py but it is throwing me this error -

_RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Aborted_

I have checked with both the Main branch and v0.1.6 but the same results.

Interesting observation -

is that Model training is working fine but only with running KNN, it is throwing this error.

Additional information -

I work in an HPC environment

Environment:


`sys.platform linux Python 3.8.5 (default, Sep 13 2021, 15:26:00) [GCC 10.3.0] numpy 1.19.5 Pillow 7.0.0.post3 vissl 0.1.5 @/p/project/deepacf/kiste/DC/vissl_b/vissl GPU available True GPU 0 NVIDIA A100-SXM4-40GB CUDA_HOME /p/software/juwelsbooster/stages/2020/software/CUDA/11.3 torchvision 0.9.0a0 @/p/software/juwelsbooster/stages/2020/software/torchvision/0.9.1-gcccoremkl-10.3.0-2021.2.0-Python-3.8.5/lib/python3.8/site-packages/torchvision-0.9.0a0-py3.8-linux-x86_64.egg/torchvision hydra 1.0.7 @/p/project/deepacf/kiste/DC/vissl_booster/venv/lib/python3.8/site-packages/hydra classy_vision 0.6.0.dev @/p/home/jusers/chatterjee1/juwels/.local/lib/python3.8/site-packages/classy_vision tensorboard 2.5.0 apex 0.1 @/p/home/jusers/chatterjee1/juwels/.local/lib/python3.8/site-packages/apex cv2 4.5.2 PyTorch 1.8.1 @/p/software/juwelsbooster/stages/2020/software/PyTorch/1.8.1-gcccoremkl-10.3.0-2021.2.0-Python-3.8.5/lib/python3.8/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Address sizes 43 bits physical, 48 bits virtual CPU(s) 96 On-line CPU(s) list 0-95 Thread(s) per core 2 Core(s) per socket 24 Socket(s) 2 NUMA node(s) 8 Vendor ID AuthenticAMD CPU family 23 Model 49 Model name AMD EPYC 7402 24-Core Processor Stepping 0 Frequency boost enabled CPU MHz 2800.000 CPU max MHz 2800.0000 CPU min MHz 1500.0000 BogoMIPS 5599.90 Virtualization AMD-V L1d cache 1.5 MiB L1i cache 1.5 MiB L2 cache 24 MiB L3 cache 256 MiB NUMA node0 CPU(s) 0-5,48-53 NUMA node1 CPU(s) 6-11,54-59 NUMA node2 CPU(s) 12-17,60-65 NUMA node3 CPU(s) 18-23,66-71 NUMA node4 CPU(s) 24-29,72-77 NUMA node5 CPU(s) 30-35,78-83 NUMA node6 CPU(s) 36-41,84-89 NUMA node7 CPU(s) 42-47,90-95 Vulnerability Itlb multihit Not affected Vulnerability L1tf Not affected Vulnerability Mds Not affected Vulnerability Meltdown Not affected Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2 Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds Not affected Vulnerability Tsx async abort Not affected

Kindly help!!

Warm regards, DC `

DC95 commented 2 years ago

I have cross-checked with both branches

vissl - 0.1.5 @/p/project/deepacf/kiste/DC/vissl_b/vissl and 0.1.6 @/p/project/deepacf/kiste/DC/vissl_b2/vissl

Attached are files for reference -

Log.txt file on running KNN log.txt Yamlconfiglink

Warm regards, DC `

QuentinDuval commented 2 years ago

Hi @DC95,

Interesting. We may have broken some things on the nearest_neighbor_test.py (although it is covered by tests, this is by no means a perfect way to catch all bugs).

Could you send me the complete command that you are using to run KNN? (to complete the configuration you already sent)

I will have a look.

Thank you, Quentin

DC95 commented 2 years ago

Hii @QuentinDuval Nice to hear from you again :)

For KNN -

Resource-wise command from HPC - run --pty --nodes=1 -A deepacf --gpus-per-node=1 --time=00:30:00 --cpu-bind=none /bin/bash

VISSL command - python run_distributed_engines.py \config=pretrain/nearest_neighbour/dcv2_96x2_128x128_germany

QuentinDuval commented 2 years ago

Hi @DC95,

Nice to talk to you again as well :)

I will try to reproduce that and keep you posted.

And thanks again for the bug report, this is really super helpful for us 👍

QuentinDuval commented 2 years ago

Hi @DC95,

I could not find any issue with your configuration and could run the KNN properly. The issue I found was instead in the command line. You should use tools/nearest_neighbor_test.py instead of run_distributed_engine.py.

Like so:

python ~/project/ssl_scaling/tools/nearest_neighbor_test.py config=test/temp/dcv2_96x2_128x128_germany

Tell me if this works better now !

DC95 commented 2 years ago

Hi @QuentinDuval

I am really sorry for the confusion. I am a bit ashamed of myself about this. How could I miss this minor detail?

I was looking everywhere for the answers—a Big thanks to you. We should close this issue.

Regards, DC

QuentinDuval commented 2 years ago

Hi @DC95,

Don't worry, that happens to everyone :) To be fair, I found it quickly because it happened to me ^^

Closing the issue !

Thank you, Quentin