BorgwardtLab / TOGL

Topological Graph Neural Networks (ICLR 2022)
https://openreview.net/pdf?id=oxxUMeFwEHd
BSD 3-Clause "New" or "Revised" License
105 stars 20 forks source link

Errors while installing #3

Closed DavideBuffelli closed 2 years ago

DavideBuffelli commented 2 years ago

Hi, I am sorry in advance for the long post, but I am having a lot of issues installing this repository, and I would highly appreciate any help you can provide.

I am trying to install this code on a machine with Linux 20.04.3 and CUDA 11.1.

The first thing I tried was to follow exactly the instructions you provide, so I did:

conda create -n togl python=3.8.0
conda activate togl
poetry install 
poetry run install_deps_cu110
poetry run python topognn/train_model.py --model TopoGNN --dataset ENZYMES --max_epochs 10

but I get the following error:

File "topognn/train_model.py", line 12, in <module>
    import topognn.models as models
  File "/home/dbuf/togl/TOGL/topognn/models.py", line 10, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_sparse/__init__.py", line 12, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch/_ops.py", line 104, in load_library
    ctypes.CDLL(path)
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/ctypes/__init__.py", line 369, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_sparse/_convert.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIdEEPKNS_6detail12TypeMetaDataEv

I thought the error could be that the Pytorch Geometric libraries are installed for CUDA 11.0 and not 11.1 as in my machine, so I modified the dep.py file accordingly and reinstalled everything, but I then got the following error:

Traceback (most recent call last):
  File "topognn/train_model.py", line 12, in <module>
    import topognn.models as models
  File "/home/dbuf/togl/TOGL/topognn/models.py", line 10, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch_sparse/__init__.py", line 15, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/site-packages/torch/_ops.py", line 104, in load_library
    ctypes.CDLL(path)
  File "/home/dbuf/anaconda3/envs/togl/lib/python3.8/ctypes/__init__.py", line 369, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

At this point I thought I should try installing without using poetry, as it seems to be messing up the installation of PyTorch and PyG (specially the CUDA versions). I then did the following:

conda create -n togl4 python=3.7.1
conda activate togl4

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg

pip install pytorch-lightning wandb ogb dgl

cd repos/torch_persistent_homology
pip install .

When I run python topognn/train_model.py --model TopoGNN --dataset ENZYMES --max_epochs 10 The code finally starts, but as soon as it enters the training loop, I get the following error:

"/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
    results = self._run_stage() 
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
    return self._run_train()
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1346, in _run_train
    self._run_sanity_check()
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1414, in _run_sanity_check
    val_loop.run()
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 222, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 360, in validation_step
    y_hat = self(batch)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 324, in forward
    x, x_dim1 = self.topo1(x, data)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 202, in forward
    batch = remove_duplicate_edges(batch)
  File "/home/dbuf/togl/TOGL2/topognn/data_utils.py", line 74, in remove_duplicate_edges
    edge_slices = torch.tensor(batch.__slices__["edge_index"],device= device)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/data/data.py", line 362, in __getattr__
    return getattr(self._store, key)
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/data/storage.py", line 53, in __getattr__
    f"'{self.__class__.__name__}' object has no attribute '{key}'")
AttributeError: 'GlobalStorage' object has no attribute '__slices__'

I thought the issue could be that the version of PyTorch Geometric that is installed this way is not the one you have in your poetry.lock file, so I then redid this same procedure but installed torch-geometric==1.6.3. However I then get the error:

Traceback (most recent call last):
  File "train_model.py", line 13, in <module>
    import models as models
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 13, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 5, in <module>
    from .dataloader import DataLoader, DataListLoader, DenseDataLoader
  File "/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch_geometric/data/dataloader.py", line 5, in <module>
    from torch._six import container_abcs, string_classes, int_classes
ImportError: cannot import name 'container_abcs' from 'torch._six' (/home/dbuf/anaconda3/envs/togl4/lib/python3.7/site-packages/torch/_six.py)

I then tried to install PyG from pip instead of from conda:

conda create -n togl5 python=3.7.1
conda activate togl5

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.8.1+cu111.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.8.1+cu111.html
pip install torch-cluster -f https://data.pyg.org/whl/torch-1.8.1+cu111.html
pip install torch-spline-conv -f https://data.pyg.org/whl/torch-1.8.1+cu111.html
pip install torch-geometric==1.6.3

pip install pytorch-lightning wandb ogb dgl

cd repos/torch_persistent_homology
pip install .

But I then get:

Traceback (most recent call last):
  File "train_model.py", line 13, in <module>
    import models as models
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 13, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_sparse/__init__.py", line 41, in <module>
    from .tensor import SparseTensor  # noqa
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_sparse/tensor.py", line 13, in <module>
    class SparseTensor(object):
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch/jit/_script.py", line 974, in script
    _compile_and_register_class(obj, _rcb, qualified_name)
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch/jit/_script.py", line 67, in _compile_and_register_class
    torch._C._jit_script_class_compile(qualified_name, ast, defaults, rcb)
RuntimeError: 
Tried to access nonexistent attribute or method 'crow_indices' of type 'Tensor'.:
  File "/home/dbuf/anaconda3/envs/togl5/lib/python3.7/site-packages/torch_sparse/tensor.py", line 109
    def from_torch_sparse_csr_tensor(self, mat: torch.Tensor,
                                     has_value: bool = True):
        rowptr = mat.crow_indices()
                 ~~~~~~~~~~~~~~~~ <--- HERE
        col = mat.col_indices()

At this point I was desperate and then tried to just install the CPU version, first like this:

conda create -n togl_cpu python=3.7.1
conda activate togl_cpu 

poetry install 
poetry run install_deps_cpu 
poetry run python topognn/train_model.py --model TopoGNN --dataset ENZYMES --max_epochs 10

which leads to

Traceback (most recent call last):
  File "topognn/train_model.py", line 12, in <module>
    import topognn.models as models
  File "/home/dbuf/togl/TOGL_cpu/topognn/models.py", line 10, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_sparse/__init__.py", line 13, in <module>
    library, [osp.dirname(__file__)]).origin)
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch/_ops.py", line 104, in load_library
    ctypes.CDLL(path)
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/ctypes/__init__.py", line 356, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_sparse/_convert.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIN3c107complexIfEEEEPKNS_6detail12TypeMetaDataEv

And finally like this:

conda create -n togl_cpu python=3.7.1
conda activate togl_cpu 

conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cpuonly -c pytorch

pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.1+cpu.html 
pip install --no-index torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.1+cpu.html       
pip install --no-index torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.1+cpu.html   
pip install --no-index torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.1+cpu.html 
pip install torch-geometric==1.6.3

pip install pytorch-lightning wandb ogb dgl

cd repos/torch_persistent_homology
pip install .

which leads to

Traceback (most recent call last):
  File "train_model.py", line 13, in <module>
    import models as models
  File "/home/dbuf/togl/TOGL2/topognn/models.py", line 13, in <module>
    from torch_geometric.nn import GCNConv, GINConv, global_mean_pool, global_add_pool
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch_sparse/__init__.py", line 19, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/site-packages/torch/_ops.py", line 104, in load_library
    ctypes.CDLL(path)
  File "/home/dbuf/anaconda3/envs/togl_cpu/lib/python3.7/ctypes/__init__.py", line 356, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libc10_cuda.so: cannot open shared object file: No such file or directory

Is there something wrong that I am doing, or have you encountered some of these issues before?

Pseudomanifold commented 2 years ago

Dear Davide,

Thanks for this report. I am sorry you are running into these issues. I cannot comment on all instances individually, but I think that Python should be at least version 3.8 for all this to work. A few of the instances that you tried out appear to me to be code incompatibilities (torch/torch-geometric etc.).

Your first install script seems to go in the right direction! The error that you are getting indicates that torch-sparse was not installed with proper linker dependencies. What I would try first here is the following:

You could also replace the torch and torch-geometric installation here by the following:

$ pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu111
$ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.8.1+cu111.html

Unless I am mistaken, this should be better aligned with your CUDA installation!

Please let me know whether this works!

DavideBuffelli commented 2 years ago

Thank you very much for your help. I was able to align torch and torch geometric, but the error then moved to the torch_persistent_homology code:

Traceback (most recent call last):
  File "topognn/train_model.py", line 12, in <module>
    import topognn.models as models
  File "/mnt/TOGL_og/topognn/models.py", line 15, in <module>
    from topognn.layers import GCNLayer, GINLayer, GATLayer, SimpleSetTopoLayer, fake_persistence_computation#, EdgeDropout
  File "/mnt/TOGL_og/topognn/layers.py", line 8, in <module>
    from torch_persistent_homology.persistent_homology_cpu import compute_persistence_homology_batched_mt
ImportError: /root/.cache/pypoetry/virtualenvs/topognn-6MuqOE4b-py3.8/lib/python3.8/site-packages/torch_persistent_homology/persistent_homology_cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at10TensorBase8data_ptrIdEEPT_v

Is there any specific option I should pass during the installation of the torch_persistent_homology repository? I tried removing it and reinstalling, but it didn't help.

Thank you again for your help.

DavideBuffelli commented 2 years ago

Hi, I was finally able to get the code to work after discarding poetry, and updating the code to work with the latest version of PyTorch and PyG. There is however one bug that needs to be solved: in https://github.com/BorgwardtLab/TOGL/blob/2b47e647ec722dee936f3c7be4c1b74e437d2f07/topognn/models.py#L309 topo1 is an instance of TopologyLayer, which outputs a tuple of 3 elements in its forward pass, not two. Would it be correct to do the following: x, x_dim1, _ = self.topo1(x, data)

Pseudomanifold commented 2 years ago

So the torch_persistent_homology issues are best rectified by rebuilding locally. I'm afraid that all these linker errors indicate potential issues with the build process. I am sorry that I cannot be more specific here, but I lack the insight into your system.

Concerning the bug that you are mentioning, the workaround is correct. Is this raised from the current large GCN model? If this works, let's close the issue. If not, let's reopen another one to track the bug.

dgm2 commented 2 years ago

Hi, I was finally able to get the code to work after discarding poetry, and updating the code to work with the latest version of PyTorch and PyG.

Hi @DavideBuffelli , how did you solve the following error? AttributeError: 'GlobalStorage' object has no attribute '__slices__'

in other words, how to replace something like batch.__slices__['x'], since in pyg there is not this attribute?

Many thanks

DavideBuffelli commented 2 years ago

Hi @dgm2, yes the correct attribute for newer versions of PyG is: batch._slice_dict['x']

Pseudomanifold commented 2 years ago

Can you mention the versions you are using here? We might want to update this at some point.

DavideBuffelli commented 2 years ago

Yes, I tried it on PyG 2.0.4