MDIL-SNU / SevenNet

SevenNet - a graph neural network interatomic potential package supporting efficient multi-GPU parallel molecular dynamics simulations.
https://pubs.acs.org/doi/10.1021/acs.jctc.4c00190
GNU General Public License v3.0
133 stars 17 forks source link

LAMMPS-SevenNet build error #51

Closed turbosonics closed 4 months ago

turbosonics commented 4 months ago

Hello,

From our server cluster environment, we need to use venv for any pytorch projects.

Following the instruction, I was able to compile SevenNet with pytorch 2.3.0, torchgeometricm and pytorch scatter to virtual environment. SevenNet works fine in virtual environment, and the test training results looks good.

But the problem is LAMMPS-SevenNet.

First, I faced about MKL library during configuration. So I tried two cases: load intel module, or pip install mkl-include to virtual environment then designate MKL DIR. Configuration for both cases worked. For mkl-include, I checked lammps-nequip installation to include mlk directory. However, the same crash occurs.

Either I attempt to compile LAMMPS with SevenNet in the same venv with SevenNet or separate venv, LAMMPS-SevenNet compile crashes. More precisely, the configuration with cmake works, but crash occurs during build with make -j 4.

For the virtual environment of LAMMPS-SevenNet, I installed the same pytorch 2.3.0 that I used for SevenNet (pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 because we are using cuda 11.8).

Modules I load for LAMMPS-SevenNet are module load gcc/11.2.0 git cuda11.8 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 cmake (and intel/2022.3) That openmpi is CUDA-aware one.

I can't bring entire build log here because it is too long. But I can see some error messages like

error: #error C++17 or later compatible compiler is required to use PyTorch.
    4 | #error C++17 or later compatible compiler is required to use PyTorch.
      |  ^~~~~

and some messages about pair_e3gnn_parallel.h:24, comm_brick.cpp:36, torch/nn.h:5, torch/all.h:16, torch/torch.h:3, and torch/nn/init.h:99:17, and there are a lot of messages.

May I ask how can I escape from this crash? And could you share modules you used for LAMMPS-SevenNet?

YutackPark commented 4 months ago

Sorry for the inconvenience. From the error messages, it seems like LibTorch (what we need for torch C++ interface) requires C++17. You may try below

  1. In {path_to_lammps_source}/cmake/CMakeLists.txt, find below line (it could be either '11' or '14' depending on whether the lammps source is patched with patch_lammsp.sh) set(CMAKE_CXX_STANDARD 14) and change the number to 17 set(CMAKE_CXX_STANDARD 17)

and start from a new, empty build folder.

  1. Restart with lower versions of PyTorch.

I'm going to check which versions of the torch require C++17. If you want to make sure, try 1.12.1, which is the version I used for writing the paper.

Thank for the report. After I confirm working versions of the PyTorch, I'll update the docs accordingly.

P.S. you don't have to install torch vision or torch audio things to use SevenNet

turbosonics commented 4 months ago

Sorry for the inconvenience. From the error messages, it seems like LibTorch (what we need for torch C++ interface) requires C++17. You may try below

Wait, but the LAMMPS-SevenNet instruction described we need the same version of pytorch (in my case ver 2.3.0) that used for SevenNet, not libtorch. Libtorch is C version pytorch and I don't think I used that for SevenNet (and LAMMPS-SevenNet).

In this case, shall I install SevenNet using pytorch while using libtorch for LAMMPS-SevenNet? Could you make sure of this (and also from official instruction as well)?

Plus, I'm not sure if there are pytorch 1.12.1 compatible with Cuda 11.8, at least it doesn't exist according to https://pytorch.org/get-started/previous-versions/ Given that SevenNet is based on NequIP, I think SevenNet may have same limitation with NequIP, but those parts are not clearly documented in the SevenNet & LAMMPS-SevenNet instruction at all. Please do make sure about recommended pytorch (and/or libtorch version) and Cuda version for SevenNet and LAMMPS-SevenNet. We only have Cuda11.8 in our local environment, so if it is like NequIP then we need to build Cuda 11.2 or 11.3 or 11.6 here.

And thanks about torch vision/audio comment, I think more detailed and better instructions would be helpful for both SevenNet and LAMMPS-SevenNet.

YutackPark commented 4 months ago

You don't need to worry about LibTorch as it is already included in PyTorch.

I have noticed that the documentation on the official websites is somewhat lacking. However, we can confirm that all the relevant *.h files and *.so files are present in {path_to_torch}/include and {path_to_torch}/lib. SevenNet-LAMMPS uses these files for its compilation, and these are what I referred to as LibTorch in my previous comment.

Regarding the original issue, I was able to compile successfully using the first solution: configuring set(CMAKE_CXX_STANDARD 17) in {path_to_lammps_dir}/cmake/CMakeList.txt. Although my environment is not exactly the same as yours, it is worth trying this solution since I encountered the same error you mentioned before applying the solution.

Here are my compilation settings (CUDA=12.1.0 + PyTorch=2.3.1):

(tmp) [parkyutack@odin ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
(tmp) [parkyutack@odin ~]$ g++ --version
g++ (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(tmp) [parkyutack@odin ~]$ python -c 'import torch; print(torch.__version__)'
2.3.1
(tmp) [parkyutack@odin ~]$ cmake --version
cmake version 3.22.2

Plus, I'm not sure if there are pytorch 1.12.1 compatible with Cuda 11.8, at least it doesn't exist according to https://pytorch.org/get-started/previous-versions/

You might want to try using torch/1.12.1 with cuda/11.6 from the site. Using a newer version of CUDA is usually not an issue as mentioned here. Nevertheless, I recommend trying the aforementioned solution first to avoid reinstalling torch with a different version.

I agree that the current documentation is incomplete. We plan to update the documents as soon as possible, with the PyPI release.

Given that SevenNet is based on NequIP, I think SevenNet may have same limitation with NequIP, but those parts are not clearly documented in the SevenNet & LAMMPS-SevenNet instruction at all.

Could you clarify what you mean by "same limitation with NequIP"? SevenNet is 'technically' a different package as I re-wrote the NequIP architecture from scratch. If you are referring to the torch+cuda version combinations, I agree that thoroughly tested versions should be documented to assist users.

turbosonics commented 4 months ago

Regarding the original issue, I was able to compile successfully using the first solution: configuring set(CMAKE_CXX_STANDARD 17) in {path_to_lammps_dir}/cmake/CMakeList.txt.

Thanks.

I also compiled LAMMPS-SevenNet using the same way (change set(CMAKE_CXX_STANDARD 11) to set(CMAKE_CXX_STANDARD 17)) from Cuda 11.8 and pytorch 2.3.0, from the same virtual environment that I build SevenNet. I tested serial and parallel test example simulations, it worked without error.

For LAMMPS-SevenNet from our cluster, I installed MKL via pip install mkl-include, in this case I need to set MKL location using -D MKL_INCLUDE_DIR from cmake. This part is also the same with NequIP, so I followed, and it worked. I think it would be good to include this part, just in case.

Could you clarify what you mean by "same limitation with NequIP"?

It is not important thing, nothing to worry about. This is about pytorch version. I mean, the current NequIP suggest to 1.11. or 1.13. only, no 1.12.* and they noted some users reported problems regarding pytorch 2+ version. I was thinking maybe SevenNet has the same or similar pytorch version limit. But now, pytorch 2.3.0 with Cuda 11.8 works, so we are good.