Constructing dataset takes a long time

XiaoliangWang2001 commented 1 month ago

Hello, I am trying to use MARBLE for my data, which has dimensions $$40000 \times 8$$.

However, when I try to construct a dataset, it can take a few hours and eventually crash the jupyter kernel. I pass all my data as a single array, and have also tried tuning a bit with spacing (0.015/0.03) or splitting my data into multiple (2 or 3) arrays (passed as list), but does not seem to solve the issue. All other parameters are set to default values.

Do you have any insights into what might be wrong or suggestions about what I might try to get it working? Thanks!

agosztolai commented 1 month ago

Hi Xiaoliang,

thanks for letting me know about the issue. Could you please send me the code you are running, including the hyperparameters you pass into MARBLE? This would be necessary to identify the bottleneck.

Thanks!

XiaoliangWang2001 commented 1 month ago

Hi, thank you for you reply. My code are as follows:

train_set = train_set.reshape(-1, 8) # (40000, 8)
diff = np.diff(train_set, axis=0)
x = [diff]
pos = [train_set[:-1]]
data = MARBLE.construct_dataset(pos, x, spacing=0.015)

Also I had to pre-assign gamma=none in compute_distribution_distances of geometry.py for a single array to be passed to construct_dataset. It seems when the list length is 1, nl is 0, and the loop on line 231 is not entered. As a result, gamma does not have an assigned value when it is returned.

agosztolai commented 1 month ago

I have fixed the data construction issue. The last step, the construc_dataset(), involves computing the eigendecomposition of the graph Laplacian matrix, which is used for the diffusion. In the paper, we worked with smaller datasets, where computing the whole spectrum was possible. However, we did have the functionality to work with larger datasets, which was not enabled. I have enabled it now.

Please pull the latest repo, and see the following code:

import MARBLE
import numpy as np

train_set = np.random.uniform(size=(40000,8))
diff = np.diff(train_set, axis=0)
x = [diff]
pos = [train_set[:-1]]
data = MARBLE.construct_dataset(pos, x, spacing=0.01, number_of_eigenvectors=5)

par = { "epochs": 100,
        "hidden_channels": 100,
        "out_channels": 4,
        "batch_size": 32, 
        "diffusion": True,
      }

model = MARBLE.net(data, params=par)
model.fit(data)

The new addition is the number_of_eigenvectors, which limits the number of eigenvectors computed. If you set it to None, then you will compute all eigenvectors (the default). I have set it to 5, which will compute 5. It is hard to say what this number should be, but I suggest you keep increasing it until your results stabilise. You can also remove the spacing parameter at this point, but I have set it just to show a full example.

I will look into the bug with compute_distribution_distances later this week. By the sound of it, you have already figured out a workaround.

I hope this helps.

XiaoliangWang2001 commented 1 month ago

Thanks! I have tried 5 eigenvectors which was very fast to compute (~1min). I will try increasing the number of eigenvectors to see if I get better results.

agosztolai commented 1 month ago

I am glad that works. Unless your data is very simple, you will likely need more than 5 eigenvectors. So, it is often a better approach to increase spacing instead while keeping the number of eigenvectors high or using all (number_of_eigenvectors=None).

Another approach could be disabling diffusion ("diffusion": False) when constructing the MARBLE model. In this case, you can set number_of_eigenvectors to anything (e.g., 1 to make it quick) because they will not be used during training. If your dataset is not very noisy, this often works. Then, you can always reintroduce diffusion, which typically yields better results but will likely need more eigenvectors.

I hope this helps.

agosztolai / MARBLE

Constructing dataset takes a long time #29