slicetca.grid_search returned nan

441YSK441 commented 3 weeks ago

Hi. Thank you for developing the great tool for analysis.

When I run the below code in sliceTCA_notebook_1.ipynb, the value of loss_grid was "nan" while If I deleted "mask_train" and "mask_test", loss_grid returned something. Do you know how to solve this problem? I did not modify any part except "sample_size".

# this will take a while to run as it fits 3*3*3*4 = 108 models
loss_grid, seed_grid = slicetca.grid_search(reconstructed_noisy_tensor,
                                            min_ranks = [3, 0, 0],
                                            max_ranks = [5, 2, 2],
                                            sample_size=1,
                                            mask_train=train_mask,
                                            mask_test=test_mask,
                                            processes_grid=4,
                                            seed=1,
                                            min_std=10**-4,
                                            learning_rate=5*10**-3,
                                            max_iter=10**4,
                                            positive=True)

Another requirement Could you share an additional code which describe the flow of analysis in figure 3 of Pellegrino et al paper?

arthur-pe commented 3 weeks ago

Hi, when I run the grid_search with a mask I don't get nans in loss_grid, so I'd need a bit more information to reproduce the bug:

Could you let us know your Python and Pytorch versions (or if you are running the notebook on Colab)?
Can you share the arguments you are passing to slicetca.block_mask? One way I could see the loss be nan only when using a mask is if your train or test mask is False for all entries.

The pipeline schematized in Fig. 3. is roughly what is done in the notebook.

441YSK441 commented 3 weeks ago

Thank you for the reply.

I used Python 3.10.12 and torch 2.3.0 + cu121.
I did not change any content of the code except "sample_size (4 to 1)" in "slicetca.grid_search" for faster calculation. When I used Colab, I could get correct loss_grid value.

One thing I notice is that after running "slicetca.grid_search" in my environment, "reconstructed_noisy_tensor" is changed to the matrices containing only zero and "train_mask" and "test_mask" are changed to the matrix containing only False. (before running "slicetca.grid_search", the value of these matrices are normal.)

Another thing is that the number of true and false in train_mask and test_mask is different between my environment and Colab environment (as shown below). Do you have any idea about the cause of these problem?

The number of true in train_mask: 6024557 The number of false in train_mask: 1209943 The number of true in test_mask: 672370 The number of false in test_mask: 6562130

Colab: The number of true in train_mask: 6027016 Colab: The number of false in train_mask: 1207484 Colab: The number of true in test_mask: 672298 Colab: The number of false in test_mask: 6562202

arthur-pe commented 2 weeks ago

I tried running the example notebook with Python 3.10.12 and torch 2.3.0 but I still can't reproduce the issue.

Perhaps you could try to run this to check this is not an issue with the notebook:

device = ('cuda' if torch.cuda.is_available() else 'cpu')

T = torch.randn((10, 10, 10), device=device)

mask_train, mask_test = slicetca.block_mask(list(T.shape), [1, 0, 1], [1, 0, 0], fraction_test=0.1, device=device)

loss_grid, seed_grid = slicetca.grid_search(T, mask_train=mask_train, mask_test=mask_test, min_ranks=[0, 0, 0], max_ranks=[1, 0, 1], max_iter=2)

print(loss_grid)

I indeed get a non-nan loss_grid. The test_mask doesn't get modified. Note that to check the proportion of masked entries you can do print(test_mask.float().mean())

Regarding the number of masked entries, I believe this is just a difference in the RNG seeds.

441YSK441 commented 2 weeks ago

When I run the given code once in my setup, loss_grid was nan. However when I run twice, I could get non-nan loss_grid. In addition, when I run on Mac environment (I used to run the codes on Windows), I could get non-nan loss_grid by any code.

I'm sorry for the ambiguous comments. It's just case reports. I totally don't know what the cause of the problem, but I could get values by using Mac. Maybe I will use Mac to calculate loss_grid. Thank you.

arthur-pe / slicetca

slicetca.grid_search returned nan #1