fdraxler / PyTorch-AutoNEB

PyTorch AutoNEB implementation to identify minimum energy paths, e.g. in neural network loss landscapes
https://arxiv.org/abs/1803.00885
MIT License
53 stars 12 forks source link

Which Pytorch and Python version? #2

Closed jafermarq closed 5 years ago

jafermarq commented 5 years ago

Hello,

Nice paper! I'm trying to run one of the smallest models you provide in the directory with all the .yaml files but I'm encountering some issues. Could you let me know which Pytorch version (i suspect <0.4) and Python version (i guess >3.5) did you use? Also, for the configuration cifar10-cnn12.yaml, how long does the script suppose to take? (I got dual Xeon CPUs setup with 1080ti)

Thanks!

fdraxler commented 5 years ago

Hey,

thanks a lot for your interest! What issues occur? The code is tested with Python 3.6.5 and PyTorch 0.4.0.

I don't remember the exact timing, but I estimate that finding each minimum will take around 20 minutes and then connecting one pair of minima around double that time. Bear in mind that the shallow CNNs don't have very low barriers, but you should move to at least two layer networks (that also have fewer parameters b/c of the pooling layer which you will like to save disk space). I personally liked to play around with cifar10-resnet20.yaml.

I hope I could help you!

jafermarq commented 5 years ago

Thanks for the above.

There are some issues with functools.reduce() when is used to get the total number of elements in a given tensor in ../models/__init__.py lines 101 and 106. The issue comes because some of the buffer tensors have dimensions 0. For example:

    t = torch.tensor(3)
    print t.dim() # will print 0
    print t.shape # will print torch.Size([])

   # if instead you do:
    t = torch.tensor([3])
    print t.dim() # will print 1 
    print t.shape # will print torch.Size([1]) (and therefore functools.reduce() will work ok)

If fixed these issues by checking if the dimension (not the shape) of the tensor is bigger than zero. If not, I skip the reduce() line and set size=1.

I could spot another point in the code where another issue happens because of having a tensor with dimensions equal to zero: line 121 in ../models/__init__.py. I've solved the crash in this line by only evaluating it if data.dim() != 0.

With all these changes, I'm not sure If I've broken some of the algorithm's behavior. After running the algorithm for quite some time using the smallest CNN (cifar10-cnn12.yaml), it crashes when calling bar.update() in ../torch_autoneb/__init__.py in function landscape_exploration.

jafermarq commented 5 years ago

The issues above happen with Pytorch 0.4.1, with Pytorch 0.4.0 the functools.reduce() issue doesn't happen.

fdraxler commented 5 years ago

Thanks for your detailed analysis! I guess that some buffer initialisation changed in 0.4.1 that is causing the error where the size of the buffer changes from []. The whole script might break when reading/writing the coordinates from/to the model if the size changes during the execution. Since I cannot test your proposed chanhes right now, I will change the installation instructions to require 0.4.0 for the moment -.-

As for the update issue: Can you give me the full stack trace, please? You will probably not have this issue when you install the tqdm package. This also let's you see the estimated time to run the script.

jafermarq commented 5 years ago

installing tqdm solved that issue and everything seems to be running as it should. Thank you for your help on this. Just as a minor side note, pyyaml should be added to requirements.txt

jafermarq commented 5 years ago

Hello. Your comments above were helpful and the code runs perfectly. However, the generated graph.p seems to be empty when I open it with Matlab. I have run an experiment using the smallest CIFAR-10 CNN network but limiting minima_count: 4 (instead of 10). This is the output log I get:

Connecting 4 <-> 3 based on disconnected.
Saddle loss between 4 and 3 is 1.069509506225586, included in MST.
Connecting 4 <-> 1 based on disconnected.
Saddle loss between 4 and 1 is 1.0885933637619019, included in MST.
Connecting 4 <-> 2 based on disconnected.
Saddle loss between 4 and 2 is 1.0375339984893799, included in MST.
Connecting 1 <-> 3 based on mst.
Saddle loss between 1 and 3 is 1.0494132041931152, included in MST.
Connecting 3 <-> 2 based on mst.
Saddle loss between 3 and 2 is 0.9639773964881897, included in MST.
Connecting 1 <-> 2 based on mst.
Saddle loss between 1 and 2 is 1.0375984907150269, included in MST.
Average loss in MST: 1.0130366285641987.

And the final graph.p is 6.19 GB. When I open it in Matlab I obtain:

>> graph

ans = 

  graph with properties:

    Edges: [0×1 table]
    Nodes: [0×0 table]

I believe I'm doing something wrong here. What is it stored in this file?

Thanks

fdraxler commented 5 years ago

The file is a Python pickle. It contains networkx.MultiGraph that contains as nodes the minima and as edges the paths. The Jupyter notebook Evaluate.ipynb shows you how to look at these graphs.