OSError: dlopen after running "poetry run python topognn/train_model.py --model TopoGNN --dataset DD --max_epochs 10" - Githubissues

BorgwardtLab / TOGL

Topological Graph Neural Networks (ICLR 2022)

https://openreview.net/pdf?id=oxxUMeFwEHd

BSD 3-Clause "New" or "Revised" License

105 stars 20 forks source link

OSError: dlopen after running "poetry run python topognn/train_model.py --model TopoGNN --dataset DD --max_epochs 10" #12

Closed cinziabandiziol closed 1 year ago

cinziabandiziol commented 1 year ago

Hello,

I'm trying to install all the repository of TOGL, including pyper and torch_persistent_homology. All seems to be run correctly except for some warnings after the installation of the 4 dependenses of torch-geometric. But when I run the command poetry run python topognn/train_model.py --model TopoGNN --dataset DD --max_epochs 10 in my Mac an error occurs:

OSError: dlopen(/Users/user/Library/Caches/pypoetry/virtualenvs/topognn-PYWNzDgl-py3.8/lib/python3.8/site-packages/torch_sparse/_convert.so, 6): Symbol not found: __ZN2at5emptyEN3c108ArrayRefIxEERKNS0_13TensorOptionsENS0_8optionalINS0_12MemoryFormatEEE

Referenced from: /Users/user/Library/Caches/pypoetry/virtualenvs/topognn-PYWNzDgl-py3.8/lib/python3.8/site-packages/torch_sparse/_convert.so

Expected in: /Users/user/Library/Caches/pypoetry/virtualenvs/topognn-PYWNzDgl-py3.8/lib/python3.8/site-packages/torch/lib/libtorch_cpu.dylib

in /Users/user/Library/Caches/pypoetry/virtualenvs/topognn-PYWNzDgl-py3.8/lib/python3.8/site-packages/torch_sparse/_convert.so

For instance I have installed python 3.8.10 and I’m using a MacBook Air with the following features: MacBook Air (M1, 2020), with chip Apple M1, 8 Giga RAM.

I have tried to install the current repository also in linux with suitable docker, following the suggestions in the conversation with the title "Undefined Symbol _ZN3c106detail19maybe_wrap_dim_slowEllb", but the code still doesn't work and stops with the following error:

Screenshot at 2022-11-13 12-53-40

No matter what OS involved, the issue seems to be related to some .so file that cannot be open or be found. Now I have no idea about the nature of this problem and how to solve it. Can you please give me any suggestions in order to manage this issue, please?? I have found really interesting your paper and it's indeed a pity having these problems in running codes.

Thanks in advance.

Pseudomanifold commented 1 year ago

Sorry for your troubles here! This seems like an issue with CUDA—could you try installing everything without GPU support, just to see where the problem is coming from? (In any case, this seems to be raised by a component from pytorch-geometric about which we don't necessarily have full control. I hope we can get to the bottom of this)

cinziabandiziol commented 1 year ago

With your question "could you try installing everything without GPU support...." do you mean to run the command poetry run install_deps_cpu? If so, I've already done in my Mac and the error is the same. Otherwise please give me further information so that I can make the suitable tests.

Pseudomanifold commented 1 year ago

No, that's what I meant. In this case, I would suggest trying to update torch_sparse in the respective virtual environment. Can you check whether the version was built from source or not?

cinziabandiziol commented 1 year ago

After several tests, at the end I've installed pytorch, torch-geometric and other dependenses. I think that the problem was related to the library install. You must install these libraries related on the only cpu usage. You have to "explicit declare" in the installation phase. I try to import these library in a dedicated environment and all is ok. When I put all togheter with togl and I run a script with from torch_persistent_homology.persistent_homology_cpu import compute_persistence_homology_batched_mt, it has some problem. Probably I do some mistakes using poetry, so I ask you some questions in order to solve my doubts.

I create a conda environment with python 3.8 and, after the activation, I run 3 times the command poetry install in the 3 directory, pyper-master, torch_persistent_homology and TOGL-main. Is it right?
For pyper-master all is ok in the installation process, instead for the package torch_persistent_homology some errors occurs due to the impossibility to install some library related to cuda. But I'm not interested in usign cuda and gpu but only cpu. I don't understand how/where to modify the file .toml in torch_persistent_homology in order to disable the installation of nvidia package. Can you help me also in this? Even if I delete the already exists .lock file, the same issue happens.
There is also a setup.py file, should I use it in the installation phase?
Finally, I think that the main problem is related to the file persistent_homology_cpu.cpp, from which the file train.py tries to import a function. Ss far as I know, you can import a "cpp file" in a python one only if it is compiled. I compile it after running poetry install in the torch_persistent_homology directory. Is it correct?

I'm sorry. These ones are indeed too many doubts but it's better to understand the process in order to run successfully all the scripts. Thanks in advance.

Pseudomanifold commented 1 year ago

Can you post the specific error messages that you get here?

cinziabandiziol commented 1 year ago

Now the error issue screen is

Pseudomanifold commented 1 year ago

This looks like a linker problem. Can you try installing this module manually from the sources (i.e. cloning the module and executing setup.py in the respective conda environment). Very sorry for the hassle!

cinziabandiziol commented 1 year ago

I'm really sorry for the delay of this answer. I followed your suggestions by without any good results. Thus I have found other codes and finally run my tests. I have not fixed the issue yet but now I'm working on other topic. So I stop running tests using the codes of this repository. Thanks again for your support.

Pseudomanifold commented 1 year ago

Sorry to hear that! All the best for your tests; let me know what other code we could integrate here.