Wavelets improperly generated in JupyterHub on rusty

Chris-Pedersen commented 2 years ago

Very odd - when running my notebooks such as scattering_conv_playground.ipynb on JupyterHub on rusty, the wavelets stored in scattering.psi have NaNs dotted around them which obviously breaks all the convolutions. When running the exact same notebook on my desktop or laptop, I get no NaNs (same random seed, and I have verified that the wavelet parameters are the same). I have checked kymatio, python, numpy and torch versions are the same. I am running JupyterHub using a kernel constructed from my conda environment on rusty. When I rerun the same code block from the terminal in this environment, I also get no NaNs, so I don't think this is an environment issue.

So just to be clear:

The filter parameters are definitely not causing it (same filter params on either terminal or a different cluster produces no NaNs)
Running the same code in the same conda environment from the terminal produces no NaNs
The versions of the key packages like numpy and torch are not producing the NaNs.

A bit at a loss on this one at the moment.. perhaps something unusual is being done when Jupyter imports the conda environment kernel, maybe the best option is to contact scc. @eickenberg have you ever encountered anything like this?

Chris-Pedersen commented 2 years ago

Just to add a bit more on this, when running https://github.com/Chris-Pedersen/Wavelets/blob/main/playground/scattering_conv_playground.ipynb on JupyterHub on rusty, printing the wavelet parameters I get:

[tensor([4.3760, 1.7979, 1.4253, 3.4640, 4.5206, 2.6585, 6.1623, 4.3029, 3.0218,
         2.4637, 2.1563, 4.5808, 2.7556, 0.3750, 2.5010, 4.6370]),
 tensor([0.7154, 0.7468, 0.7129, 0.6561, 0.7132, 0.9467, 0.9721, 0.7509, 0.8120,
         0.5578, 0.6586, 0.7074, 0.9332, 0.6252, 0.7415, 0.9928]),
 tensor([4.3619, 4.5219, 3.0103, 4.8131, 4.5062, 4.4083, 3.9638, 3.8507, 4.1507,
         4.6248, 4.8696, 4.3449, 4.6075, 4.4783, 4.5408, 4.6153]),
 tensor([0.6825, 0.6755, 1.0316, 1.0318, 1.1344, 1.3494, 1.2245, 1.1110, 1.2224,
         0.8230, 0.8618, 0.7283, 0.7937, 1.1310, 0.5921, 0.9337])]

and printing scatteringBase.psi[0][0] gives

tensor([[[-1.2261e-08],
         [        nan],
         [-3.5012e-03],
         ...,
         [ 1.1419e-02],
         [ 6.5788e-03],
         [ 2.8291e-03]],

        [[-7.9257e-04],
         [        nan],
         [-3.9038e-03],
         ...,

Doing the same thing in the terminal, using the same conda environment, I get:

>>> scatteringBase.params_filters
[tensor([4.3760, 1.7979, 1.4253, 3.4640, 4.5206, 2.6585, 6.1623, 4.3029, 3.0218,
        2.4637, 2.1563, 4.5808, 2.7556, 0.3750, 2.5010, 4.6370]), tensor([0.7154, 0.7468, 0.7129, 0.6561, 0.7132, 0.9467, 0.9721, 0.7509, 0.8120,
        0.5578, 0.6586, 0.7074, 0.9332, 0.6252, 0.7415, 0.9928]), tensor([4.3619, 4.5219, 3.0103, 4.8131, 4.5062, 4.4083, 3.9638, 3.8507, 4.1507,
        4.6248, 4.8696, 4.3449, 4.6075, 4.4783, 4.5408, 4.6153]), tensor([0.6825, 0.6755, 1.0316, 1.0318, 1.1344, 1.3494, 1.2245, 1.1110, 1.2224,
        0.8230, 0.8618, 0.7283, 0.7937, 1.1310, 0.5921, 0.9337])]

>>> scatteringBase.scattering.psi[0][0]
tensor([[[-1.2644e-08],
         [-2.0639e-03],
         [-3.5013e-03],
         ...,
         [ 1.1419e-02],
         [ 6.5788e-03],
         [ 2.8291e-03]],

        [[-7.9257e-04],
         [-2.6254e-03],
         [-3.9038e-03],
         ...,

So somehow the filters that the code is producing are different, even though the parameters are the same.

Chris-Pedersen commented 2 years ago

Ok this is somehow fixed by the latest PR #26 so closing, perhaps some bug with the kernel

Chris-Pedersen commented 2 years ago

Reopening this as it's becoming a bit of an obstacle to running some tests. It occurs when using randomly initialised filters, and when using a conda environment. When running the same script:

[cpedersen@workergpu115 scripts]$ python3 sn_debug.py
CUDA Available
/mnt/home/cpedersen/.local/lib/python3.6/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
tensor([[[ 8.8521e-10+0.0000e+00j,  6.7210e-05+1.2745e-11j,
           1.3836e-04+1.0620e-12j,  ...,
          -1.6191e-04+1.6484e-11j, -1.1581e-04+1.2596e-11j,
          -6.1489e-05+1.3602e-11j],
         [ 4.1143e-04-1.4684e-09j,  5.1359e-04-1.1640e-09j,
           6.1729e-04-9.0628e-10j,  ...,
           1.3937e-04-2.3496e-09j,  2.2221e-04-2.0354e-09j,
           3.1352e-04-1.7559e-09j],
         [ 1.0348e-03-2.9451e-09j,  1.1884e-03-2.9773e-09j,
           1.3399e-03-3.0218e-09j,  ...,
           5.9933e-04-2.6364e-09j,  7.3684e-04-2.7482e-09j,
           8.8304e-04-2.8732e-09j],
         ...,
         [-5.0493e-04-4.1458e-08j, -4.8694e-04-4.4042e-08j,
          -4.6242e-04-4.6318e-08j,  ...,
          -5.1632e-04-3.2544e-08j, -5.1970e-04-3.5603e-08j,
          -5.1593e-04-3.8625e-08j],
         [-4.1962e-04-4.5405e-08j, -3.9140e-04-4.8363e-08j,
          -3.5698e-04-5.0979e-08j,  ...,
          -4.6110e-04-3.5220e-08j, -4.5478e-04-3.8732e-08j,
          -4.4090e-04-4.2151e-08j],
         [-2.6202e-04-4.3522e-08j, -2.1824e-04-4.5917e-08j,
          -1.6907e-04-4.8030e-08j,  ...,
          -3.5093e-04-3.5202e-08j, -3.2904e-04-3.8081e-08j,
          -2.9923e-04-4.0869e-08j]],

(wavelet) [cpedersen@workergpu115 scripts]$ python3 sn_debug.py
CUDA Available
/mnt/home/cpedersen/miniconda3/envs/wavelet/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811757271/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
tensor([[[nan+nanj, nan+nanj, nan+nanj,  ..., nan+nanj, nan+nanj, nan+nanj],
         [nan+nanj, nan+nanj, nan+nanj,  ..., nan+nanj, nan+nanj, nan+nanj],
         [nan+nanj, nan+nanj, nan+nanj,  ..., nan+nanj, nan+nanj, nan+nanj],

Chris-Pedersen commented 2 years ago

So this is due to numpy version, and to do with the random seed generation. Will fix this after the ICML submission, a workaround for now is to use numpy=1.19.5, whereas my conda environment was using 1.21.2. Leaving this open as a reminder to fix the random initialisation code to not be so strongly dependent on numpy version.

Chris-Pedersen / LearnableWavelets

Wavelets improperly generated in JupyterHub on rusty #21