[cifar_quickstart.ipynb] _LinAlgError: linalg.inv: The diagonal element 1 is zero, the inversion could not be completed because the input matrix is singular.

xszheng2020 commented 1 year ago

Hi, @kristian-georgiev , thanks for your great work.

I tested the example cifar_quickstart.ipynb, everything goes well until this line traker.finalize_features(). It produces the above error.

Any idea?

xszheng2020 commented 1 year ago

Avoid this issue by changing the random projector from rademacher to normal

for model_id, ckpt in enumerate(tqdm(ckpts)):
    traker.load_checkpoint(ckpt, model_id=model_id)
    traker.projector.proj_type = ProjectionType.normal # !!!

It seems the project_rademacher_32 will project the grads to zero vector! My GPU is V100.

kristian-georgiev commented 1 year ago

@xszheng2020 I am having trouble reproducing this. Can you please provide a conda environment from which the above behavior can be reproduced (in particular, we'll need exact torch build, gcc version, and cuda version). Thanks!

xszheng2020 commented 1 year ago

@kristian-georgiev I use a docker image from nvidia as following docker run -it --gpus all --name xszheng-pytorch-2211 --ipc=host -v /raid/xszheng/codes:/opt/codes nvcr.io/nvidia/pytorch:22.11-py3 bash This image's PyTorch is 1.13.0, which does not meet the requirement torch>=1.13, so I install the PyTorch 1.13.1 via pip install torch==1.13.1.

root@5c97effff7a4:/opt/codes# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

kristian-georgiev commented 1 year ago

Thank you! Just to confirm, is tests/test_rademacher.py::test_odd the only one that fails?

While we try reproducing the error, can you try using the stable version of pytorch (2.0.0 as of today) with the same gcc and cuda? I think this might resolve the problem.

xszheng2020 commented 1 year ago

@kristian-georgiev thanks! Indeed, in the beginning, I used PyTorch 2.0 (nvcr.io/nvidia/pytorch:23.03-py3) and I met even more errors. So I use another docker image to degrade the PyTorch, then only the test_odd error remains.

I also tried PyTorch 2.0 under nvcr.io/nvidia/pytorch:22.11-py3, and it still produced 15 errors.

===================================================================== short test summary info =====================================================================
FAILED tests/test_rademacher.py::test_shape[8-1024-512-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-1024-512-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-1024-1024-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-1024-1024-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-1024-2048-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-1024-2048-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-512-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-512-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-1024-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-1024-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-2048-0] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_shape[8-2048-2048-1] - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_running - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_even - RuntimeError: CUDA error: too many resources requested for launch
FAILED tests/test_rademacher.py::test_odd - RuntimeError: CUDA error: too many resources requested for launch
================================================================= 15 failed, 4 warnings in 6.86s ==================================================================

xszheng2020 commented 1 year ago

or could you please share a docker image that works well?

xszheng2020 commented 1 year ago

Hi, @kristian-georgiev I used A100 instead then the error was gone... 15 passed, 4 warnings in 10.35s

kristian-georgiev commented 1 year ago

Thank you @xszheng2020! I can reproduce the RuntimeError: CUDA error: too many resources requested for launch on V100s. This happens because the batch size of 32 is too large. I will resolve this in https://github.com/MadryLab/trak/issues/26 and link it here once it's done.

Regarding the all-zeros gradients produced by project_rademacher_32, my current hypothesis is that it comes from the interaction of TRAK with certain 1.13 versions of torch. I'll let @GuillaumeLeclerc investigate a bit more for potential bugs on our side, but if we continue to experience no problems with torch 2.0.0, we're going to require torch>=2.0.0 for trak 0.1.2 and above in order to avoid such silent errors.

kristian-georgiev commented 1 year ago

Resolved #26 in the 0.1.2 branch (ETA to merge it into main and update on PyPI is ~week).

kristian-georgiev commented 1 year ago

Updated the requirements of fast_jl from torch>=1.13 to torch>=2.0.0 in TRAK v0.1.2 (https://github.com/MadryLab/trak/pull/28). I believe this should resolve the original issue (all-zeroes gradients from project_rademacher_32). Feel free to open this issue again if you experience something similar.

MadryLab / trak

[cifar_quickstart.ipynb] _LinAlgError: linalg.inv: The diagonal element 1 is zero, the inversion could not be completed because the input matrix is singular. #24