charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Running NAMD built with gcc on Rocky 8 against Charm++ versions >= v7.0.0 built on CentOS 7.x causes seg fault #3680

Open davidhardy opened 1 year ago

davidhardy commented 1 year ago

I can successfully run NAMD built with gcc on Rocky 8 against Charm++ versions <= 6.10.2 built on CentOS 7.x or 6.x. However, there seems to be an issue for Charm++ versions >= 7.0.0. The problem seems to be due to LBDatabase. Here is what I get from gdb from doing run followed by backtrace:

dhardy@athine> gdb ./namd2
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-18.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./namd2...done.
(gdb) run
Starting program: /Projects/dhardy/namd.master/Linux-x86_64-g++/namd2 
Enabling Intel C/C++ 19.0.0.117 20180804 for CentOS 8+
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Charm++> No provisioning arguments specified. Running with a single PE.
         Use +auto-provision to fully subscribe resources or +p1 to silence this message.
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 1 threads (PEs)
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v7.1.0-devel-189-g0504518
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 32 cores x 2 PUs = 64-way SMP)
Charm++> cpu topology info is gathered in 0.006 seconds.
CharmLB> Load balancing instrumentation for communication is off.
Info: NAMD 2.15alpha2 for Linux-x86_64-multicore
Info: 
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info: 
Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475
Info: in all publications reporting results obtained with NAMD.
Info: 
Info: Based on Charm++/Converse 70000 for multicore-linux-x86_64
Info: Built Fri Dec 16 12:53:40 CST 2022 by dhardy on athine.ks.uiuc.edu
Info: 1 NAMD  2.15alpha2  Linux-x86_64-multicore  1    athine.ks.uiuc.edu  dhardy
Info: Running on 1 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.127844 s
CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)

Program received signal SIGSEGV, Segmentation fault.
0x0000000000e277cf in LBDatabase::RegisterOM(LDOMid, void*, LDCallbacks) ()
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-189.5.el8_6.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-14.el8.x86_64 libcom_err-1.45.6-4.el8.x86_64 libgcc-8.5.0-10.1.el8_6.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libselinux-2.9-5.el8.x86_64 libstdc++-8.5.0-10.1.el8_6.x86_64 libtirpc-1.1.4-6.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openssl-libs-1.1.1k-7.el8_6.x86_64 pcre2-10.32-3.el8_6.x86_64 zlib-1.2.11-18.el8_5.x86_64
(gdb) bt
#0  0x0000000000e277cf in LBDatabase::RegisterOM(LDOMid, void*, LDCallbacks) ()
#1  0x0000000000a92290 in LBManager::RegisterOM (this=<optimized out>, cb=..., 
    userptr=0x0, userID=...)
    at /Projects/dhardy/test_namd/test_charm_builds/charm-main/multicore-linux-x86_64/include/LBManager.h:266
#2  LdbCoordinator::LdbCoordinator (this=0x166c940, __in_chrg=<optimized out>, 
    __vtt_parm=<optimized out>) at src/LdbCoordinator.C:157
#3  0x0000000000a9241c in CkIndex_LdbCoordinator::_call_LdbCoordinator_void (
    impl_msg=0x1550ee0, impl_obj_void=<optimized out>)
    at /Projects/dhardy/test_namd/test_charm_builds/charm-main/multicore-linux-x86_64/include/charm++.h:259
#4  0x0000000000daf09c in CkDeliverMessageFree ()
#5  0x0000000000daf51c in CkCreateLocalGroup ()
#6  0x0000000000daf6db in _createGroup(_ckGroupID, envelope*) ()
#7  0x0000000000daf748 in CkCreateGroup ()
#8  0x0000000000a9590d in CProxy_LdbCoordinator::ckNew (
    impl_e_opts=<optimized out>) at inc/LdbCoordinator.decl.h:140
#9  0x00000000006c624f in master_init (argc=<optimized out>, 
    argv=<optimized out>) at src/BackEnd.C:225
#10 0x000000000064a8af in main (argc=<optimized out>, argv=0x7fffffffd3b8)
    at src/mainfunc.C:49

A secondary issue is that I was unable to build NAMD against Charm++ 7.0.0 built using "--enable-error-checking". Attempting this gave me the following link error:

/Projects/dhardy/test_namd/test_charm_builds/charm-7_0_0/multicore-linux-x86_64/lib/libck.a(debug-charm.o): In function `hostInfo(void*, void*, CpdListItemsRequest*)':
debug-charm.C:(.text+0x59c): undefined reference to `get_myaddress'
collect2
rbuch commented 1 year ago

I'm not exactly clear on if this is what you're doing, but if you're trying to combine parts built with Charm++ < 7.0.0 and Charm++ >= 7.0.0, that is expected not to work. There is a breaking change in LB between the 6.x and 7.x versions which is fundamentally incompatible.

Is that what you're attempting to do here? If not, can you elucidate a bit?