TinkerTools / tinker

Tinker: Software Tools for Molecular Design
https://dasher.wustl.edu/tinker/
Other
130 stars 61 forks source link

intel compilers and SIGSEGV, segmentation fault #52

Closed mandar5335 closed 4 years ago

mandar5335 commented 4 years ago

Hi, I have installed Tinker-openMM using CUDA 9.2, intel compilers,and recent Tinker 8.7+ Tinker-OpenMM (from github). The installation completed without errors and it is based on Lee-Ping's instructions.

But, whenever I try jobs I receive following error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred longjmp causes uninitialized stack frame : /home/m/mandar/pfs/softwares/new_tinker_openmm_gpu/gcc/tinker/bin/dynamic_omm terminated

Is there any variable I should set to avoid this error? Thanks in advance.

Regards, Mandar Kulkarni

leucinw commented 4 years ago

forrtl: severe (174): SIGSEGV, segmentation fault occurred longjmp causes uninitialized stack frame : /home/m/mandar/pfs/softwares/new_tinker_openmm_gpu/gcc/tinker/bin/dynamic_omm terminated

Any other message below these lines? For example, traceback info?

mandar5335 commented 4 years ago

Hi, Thanks for the reply. Please see detailed error below, This is all printed as an output:

forrtl: severe (174): SIGSEGV, segmentation fault occurred longjmp causes uninitialized stack frame : /home/m/mandar/pfs/softwares/new_tinker_openmm_gpu/gcc/tinker/bin/dynamic_omm terminated forrtl: severe (174): SIGSEGV, segmentation fault occurred ======= Backtrace: ========= forrtl: severe (174): SIGSEGV, segmentation fault occurred srun: error: b-cn1105: task 0: Exited with exit code 174

leucinw commented 4 years ago

Thanks. I never compiled Openmm with Tinker8.7. I will try and see if there exists the same error. BTW, you said

have installed Tinker-openMM using CUDA 9.2, intel compilers,and recent Tinker 8.7+ Tinker-OpenMM

While your path indicates that you may use gcc?

/home/m/mandar/pfs/softwares/new_tinker_openmm_gpu/gcc/tinker/bin/dynamic_omm

Is there any mismatch in your compilation?

mandar5335 commented 4 years ago

Hi, That's a misnomer, sorry for the confusion. I planned initially to install with gcc but then installed using icc. This installation is on cluster. I have purged all modules first and then loaded the intel compiler module. So, I am sure this installation uses icc/ifort.

jayponder commented 4 years ago

We use the version of Tinker 8.7 (and thus the Tinker-OpenMM interface code in openmm/ommstuf.cpp) currently on GitHub, the version of OpenMM currently on GitHub as Tinker-OpenMM, CUDA 9.2 and the gcc/gfortran compilers. This combination works for us on both Linux and on MacOS.

mandar5335 commented 4 years ago

@jayponder thanks for the suggestion. I will install Tinker/OpenMM with gcc/gfortran and will check whether error persists or not. I am using Tinker-OpenMM available on github. However, Tinker 8.7 github version does not contain "fftw" folder, so I downloaded Tinker-8.7.1 from https://dasher.wustl.edu/tinker/ Do these versions differ? If yes, in that case if I transfer "fftw" folder from Tinker-8.7.1 to Tinker-8.7_github_version, is it okay?

Thanks, Mandar Kulkarni

mandar5335 commented 4 years ago

Hi everyone, i have tried to install Tinker with gnu compilers. (gcc and gfortran version 6.40 , cuda 9.1) First, I compiled fftw which was successful. then, i am facing error during "make" command.

TINKERDIR is correctly set. TINKERDIR =/home/m/mandar/pfs/softwares/gcccuda_tinker_openmm_gpu/source/tinker

My Makefile options are: F77 = gfortran F77FLAGS = -c OPTFLAGS = -Ofast -msse3 -fopenmp LIBDIR = -L. -L$(TINKER_LIBDIR)/linux LIBS = LIBFLAGS = -crusv RANLIB = ranlib LINKFLAGS = $(OPTFLAGS) -static-libgcc RENAME = rename_bin

error: /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'crystal.x': No such file Makefile:792: recipe for target 'crystal.x' failed make: [crystal.x] Error 1 make: Waiting for unfinished jobs.... /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'document.x': No such file Makefile:792: recipe for target 'document.x' failed make: *** [document.x] Error 1

Any suggestions will be really helpful. Thanks in advance.

mandar5335 commented 4 years ago

Hi everyone, i have tried to install Tinker with gnu compilers. (gcc and gfortran version 6.40 , cuda 9.1) First, I compiled fftw which was successful. then, i am facing error during "make" command.

TINKERDIR is correctly set. TINKERDIR =/home/m/mandar/pfs/softwares/gcccuda_tinker_openmm_gpu/source/tinker

My Makefile options are: F77 = gfortran F77FLAGS = -c OPTFLAGS = -Ofast -msse3 -fopenmp LIBDIR = -L. -L$(TINKER_LIBDIR)/linux LIBS = LIBFLAGS = -crusv RANLIB = ranlib LINKFLAGS = $(OPTFLAGS) -static-libgcc RENAME = rename_bin

error: /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'crystal.x': No such file Makefile:792: recipe for target 'crystal.x' failed make: [crystal.x] Error 1 make: Waiting for unfinished jobs.... /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'document.x': No such file Makefile:792: recipe for target 'document.x' failed make: *** [document.x] Error 1

Any suggestions will be really helpful. Thanks in advance.

Please ignore the above error. I have compiled Tinker-OpenMM combination successfully.

First, I cloned Tinker from TinkerTools, then copied "fftw" folder from Tinker-8.7.1.tar.gz to this version and followed Lee-Ping Wang's instructions for GCC compiler.

"fftw" folder is missing in github repository. Please is it possible to add "fftw" folder? It will avoid confusion for future users.

However, I am facing a new error after job submission as follows:

Default OpenMM Plugin Directory : /home/m/mandar/pfs/softwares/gcccuda_tinker_openmm_gpu/tinkeropenmm_exec/plugins

terminate called after throwing an instance of 'OpenMM::OpenMMException' what(): There is no registered Platform called "CUDA"

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x14a53b2b84af in ???

1 0x14a53b2b8428 in ???

2 0x14a53b2ba029 in ???

3 0x14a53df8fd9c in _ZN9__gnu_cxx27__verbose_terminate_handlerEv

    at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95

4 0x14a53df8dd65 in _ZN10cxxabiv111terminateEPFvvE

    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47

5 0x14a53df8ddb0 in _ZSt9terminatev

    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:57

6 0x14a53df8dfc7 in __cxa_throw

    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:87

7 0x14a53daaa8a1 in ???

8 0x14a53dbcb9fd in ???

9 0x409f32 in ???

10 0x412d09 in ???

11 0x408f56 in ???

12 0x40842c in ???

13 0x14a53b2a382f in ???

14 0x4084b8 in ???

15 0xffffffffffffffff in ???

/var/spool/slurmd/job7742669/slurm_script: line 37: 148973 Aborted (core dumped) /home/m/mandar/pfs/softwares/gcccuda_tinker_openmm_gpu/source/Tinker/bin/dynamic_omm test_rUU 100 2.0 0.5 2 300.0 5 > dimer.log

Thanks again, Mandar Kulkarni

pren commented 4 years ago

Please try compile/library/link.make files in https://github.com/TinkerTools/Tinker/tree/release/linux/gfortran

From: Mandar Kulkarni notifications@github.com Sent: Thursday, October 3, 2019 8:32 AM To: TinkerTools/Tinker Tinker@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [TinkerTools/Tinker] intel compilers and SIGSEGV, segmentation fault (#52)

Hi everyone, i have tried to install Tinker with gnu compilers. (gcc and gfortran version 6.40 , cuda 9.1) First, I compiled fftw which was successful. then, i am facing error during "make" command.

TINKERDIR is correctly set. TINKERDIR =/home/m/mandar/pfs/softwares/gcccuda_tinker_openmm_gpu/source/tinker

My Makefile options are: F77 = gfortran F77FLAGS = -c OPTFLAGS = -Ofast -msse3 -fopenmp LIBDIR = -L. -L$(TINKER_LIBDIR)/linux LIBS = LIBFLAGS = -crusv RANLIB = ranlib LINKFLAGS = $(OPTFLAGS) -static-libgcc RENAME = rename_bin

error: /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'crystal.x': No such file Makefile:792: recipe for target 'crystal.x' failed make: [crystal.x] Error 1 make: Waiting for unfinished jobs.... /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to 'secure_getenv' /root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/io/unix.c:1208: error: undefined reference to '__secure_getenv' collect2: error: ld returned 1 exit status strip: 'document.x': No such file Makefile:792: recipe for target 'document.x' failed make: *** [document.x] Error 1

Any suggestions will be really helpful. Thanks in advance.

Please ignore the above error. I have compiled Tinker-OpenMM combination successfully.

First, I cloned Tinker from TinkerTools, then copied "fftw" folder from Tinker-8.7.1.tar.gz to this version and followed Lee-Ping Wang's instructions for GCC compiler.

"fftw" folder is missing in github repository. Please is it possible to add "fftw" folder? It will avoid confusion for future users.

Thanks again, Mandar Kulkarni

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/TinkerTools/Tinker/issues/52?email_source=notifications&email_token=ABNC6XV2JZQCMRH4IAOGVH3QMXX3LA5CNFSM4I42UBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIGWXA#issuecomment-537946972, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABNC6XVV3OZ22NOUKXVLZJTQMXX3LANCNFSM4I42UBKQ.

This message is from an external sender. Learn more about why this matters.https://ut.service-now.com/sp?id=kb_article&number=KB0011401

hongxiahao91 commented 4 years ago

I faced the same problem when compiled the code with intel compilers! Is there any clue or solution to this?

jayponder commented 4 years ago

FFTW 3.3.8 has now been added to the Tinker distribution on GitHub. This is not our code, it is a very commonly used Fourier transform package from MIT. But we believe/hope that it is OK to directly package it with Tinker. See the 0README file in the top-level /fftw directory for Tinker specific instructions for building the FFTW libraries needed by Tinker.

jayponder commented 4 years ago

I am still unsure what the problem mandar5335 is reporting above is due to. He says he is using GNU gcc/gfortran 6.40, but from the error messages it seems 4.9.2 is really being used. Since the error appears to be coming from the GNU gcc installation itself, I suspect it could be from some issue with the gcc/gfortran setup.

In a later comment, mandar5335 reports an error of "There is no registered Platform called CUDA" at runtime. This is almost certainly due to the fact that CUDA is not installed correctly on the machine, or (more likely) that the machine is not recognizing the GPU card itself. Please first check that your machine sees the GPU.

jayponder commented 4 years ago

Recently, hongxiahao91 reports the "same problem" when using the Intel compilers. Which problem? (as there are several different problems in this thread...) Please provide more details of exactly the problem you are having.

Also, note that we do not recommend using the Intel compilers for building Tinker-OpenMM. While the Intel compilers do produce faster Tinker executables for CPUs (due mostly to a better implementation of OpenMP parallelization), there is no advantage for Tinker-OpenMM. Since all the intensive calculation is done on the GPU, and is really done by OpenMM and hence under CUDA, using the Intel compiler will not produce faster Tinker-OpenMM executables. And it may (?) be the case that if you build Tinker-OpenMM with Intel compilers, you will also have to build OpenMM itself with the same compilers. I would recommend that you just use a recent version of GNU gcc/gfortran for everything.

mandar5335 commented 4 years ago

@jayponder Professor Ponder thanks a lot for providing comments on a problem and suggestions.

I have tried again installation of Tinker 8.7.2 and OpenMM combination. The same error still persist and I have raised an issue with our HPC management to make sure it is not gcc/gfortran setup issue. I am waiting for a response from their side.

Below are the first-hand observations when I tried to re-install the Tinker-OpenMM.

  1. CUDA 9.2 with gcc 7.3.0 is available on our cluster. I loaded these modules and then at the final stage, I am facing the same error again:

"/root/gcc-4.9.2/src/gcc-4.9.2/libgfortran/runtime/main.c:175: error: undefined reference to '__secure_getenv'"

  1. Next, I tried the CUDA-10.1.243 + GCC/8.3.0 module on the cluster. I experienced the above error in the first step while installing Tinker-8.7.2 itself.

I will update once I receive any response from cluster management team.

Thanks again, Mandar Kulkarni

mandar5335 commented 4 years ago

Hello, I have successfully installed the Tinker-OpenMM combination with help from HPC management. It was not clear what caused problems earlier, but now I have a working executable dynamic_omm.

But, I am facing another issue. I am benchmarking the DHFR system right now. When I try on 2 nodes (28 procs per node, k80 GPU, 2 GPUS on each node), I get a speed of 1.0417 ns/day.

Performance:  ns/day               1.0417
               Wall Time            8.2940
               Steps                   100
               Updates                   1
               Time Step            1.0000
               Atoms                 23558
               Threads                  56

and when I try on 224 processors, the speed is still the same.

Performance:  ns/day               1.0733
               Wall Time            8.0500
               Steps                   100
               Updates                   1
               Time Step            1.0000
               Atoms                 23558
               Threads                 224

I have added export CUDA_VISIBLE_DEVICES=0,1,2,3 line in the job script. Please, could you suggest how can improve simulation speed, if using multiple GPU nodes?

Thanks again, Mandar Kulkarni

swails commented 4 years ago

Last I knew, the Amoeba kernels in OpenMM didn't support parallel execution, which means this behavior is to be expected.

Basically one GPU does all the work while the others wait for it to finish. If you want to maximize throughput on multiple GPUs, run a separate simulation for each GPU. The aggregate sampling will be maximally increased with that approach.

mandar5335 commented 4 years ago

@swails Thanks for your reply. It means I can run on a single node with GPU.