daimh / sge

Some Grid Engine/Son of Grid Engine/Sun Grid Engine
90 stars 34 forks source link

qmaster crash when $SGE_ARCH is wrong #31

Closed eddiewang927 closed 7 months ago

eddiewang927 commented 7 months ago

Hi I have 2 grid Open Grid Scheduler 2011.11p1(OGS) and some grid engine(SOGE) in my HPC.

QMASTER crash test step by step

  1. When I source OGS profile first and then source SOGE profile, I well get the $SGE_ARCH = linux-x64 (OGS environment variables)
  2. Using qrsh or qhost in SOGE ENV, SOGE qmaster will be crash, more message as shown below

[root@soge qmaster]# systemctl status sgemaster ● sgemaster.service - Grid Engine qmaster Loaded: loaded (/etc/systemd/system/sgemaster.service; enabled; vendor preset: disabled) Active: failed (Result: signal) since Tue 2024-02-06 10:07:46 CST; 1min 52s ago Process: 19579 ExecStart=/nvttool/sge/soge/default/common/sgemaster (code=exited, status=0/SUCCESS) Main PID: 19642 (code=killed, signal=ABRT)

Feb 06 10:07:19 soge systemd[1]: Starting Grid Engine qmaster... Feb 06 10:07:19 soge sgemaster[19579]: Starting Grid Engine qmaster Feb 06 10:07:19 soge systemd[1]: Started Grid Engine qmaster. Feb 06 10:07:46 soge systemd[1]: sgemaster.service: main process exited, code=killed, status=6/ABRT Feb 06 10:07:46 soge systemd[1]: Unit sgemaster.service entered failed state. Feb 06 10:07:46 soge systemd[1]: sgemaster.service failed.

[root@soge qmaster]# ps ax| grep qmaster 19642 ? Sl 0:00 /nvttool/sge/soge/bin/lx-amd64/sge_qmaster 19684 pts/1 S+ 0:00 grep --color=auto qmaster [root@soge qmaster]# strace -F -f -p 19642

[pid 19660] futex(0xaee470, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 19660] futex(0xaee42c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 122, {tv_sec=1707185267, tv_nsec=655154000}, 0xffffffff <unfinished ...>
[pid 19648] <... poll resumed>)         = 0 (Timeout)
[pid 19648] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}], 2, 1000) = 1 ([{fd=3, revents=POLLIN}])
[pid 19648] accept(3, {sa_family=AF_INET, sin_port=htons(53338), sin_addr=inet_addr("172.26.20.72")}, [16]) = 5
[pid 19648] fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 19648] setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 19648] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, 1000) = 1 ([{fd=5, revents=POLLIN}])
[pid 19648] read(5, "<gmsh><dl>180</dl></gm", 22) = 22
[pid 19648] read(5, "s", 1)             = 1
[pid 19648] read(5, "h", 1)             = 1
[pid 19648] read(5, ">", 1)             = 1
[pid 19648] read(5, "<cm version=\"0.4\"><df>bin</df><c"..., 180) = 180
[pid 19648] futex(0xab6f54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xab6f50, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1 <unfinished ...>
[pid 19649] <... futex resumed>)        = 0
[pid 19648] <... futex resumed>)        = 1
[pid 19649] futex(0xab6ef0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 19648] futex(0xab6ef0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19649] <... futex resumed>)        = 0
[pid 19648] <... futex resumed>)        = 1
[pid 19649] futex(0xab6ef0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 19648] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, 1000 <unfinished ...>
[pid 19649] poll([{fd=5, events=POLLOUT}], 1, 5) = 1 ([{fd=5, revents=POLLOUT}])
[pid 19649] write(5, "<gmsh><dl>166</dl></gmsh><crm ve"..., 191) = 191
[pid 19649] futex(0xab6f54, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 69, {tv_sec=1707185267, tv_nsec=825959000}, 0xffffffff <unfinished ...>
[pid 19648] <... poll resumed>)         = 1 ([{fd=5, revents=POLLIN}])
[pid 19648] read(5, "<gmsh><dl>99</dl></gms", 22) = 22
[pid 19648] read(5, "h", 1)             = 1
[pid 19648] read(5, ">", 1)             = 1
[pid 19648] read(5, "<mih version=\"0.1\"><mid>1</mid><"..., 99) = 99
[pid 19648] read(5, "\0\0\0\0\20\2\0\0\0\0\0\1\0\0\0\3\20\0 \370\0\0\0\0\0\0\0\0\0\0\0\1"..., 420) = 420
[pid 19648] futex(0xab62f4, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xab6290, 150 <unfinished ...>
[pid 19662] <... futex resumed>)        = 0
[pid 19648] <... futex resumed>)        = 2
[pid 19662] futex(0xab6290, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 19648] futex(0xab6290, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19662] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 19648] <... futex resumed>)        = 1
[pid 19661] <... futex resumed>)        = 0
[pid 19662] futex(0xab6290, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19661] futex(0xab6290, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 19648] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}], 3, 1000 <unfinished ...>
[pid 19662] <... futex resumed>)        = 0
[pid 19661] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 19662] write(2, "denied: client (Client/qhost/1) "..., 95) = 95
[pid 19661] futex(0xab6290, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19662] open("messages", O_WRONLY|O_CREAT|O_APPEND, 0666 <unfinished ...>
[pid 19661] <... futex resumed>)        = 0
[pid 19661] futex(0xab62f4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 151, {tv_sec=1707185267, tv_nsec=830196000}, 0xffffffff <unfinished ...>
[pid 19662] <... open resumed>)         = 6
[pid 19662] write(6, "02/06/2024 10:07:46|listen|rdsog"..., 133) = 133
[pid 19662] close(6)                    = 0
[pid 19662] futex(0xaee42c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xaee470, 124) = 2
[pid 19659] <... futex resumed>)        = 0
[pid 19662] futex(0xab62f4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 152, {tv_sec=1707185267, tv_nsec=832364000}, 0xffffffff <unfinished ...>
[pid 19659] futex(0xaee470, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19660] <... futex resumed>)        = 0
[pid 19659] <... futex resumed>)        = 1
[pid 19660] futex(0xaee470, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 19659] mprotect(0x7f9a98021000, 942080, PROT_READ|PROT_WRITE <unfinished ...>
[pid 19660] <... futex resumed>)        = 0
[pid 19659] <... mprotect resumed>)     = 0
[pid 19660] futex(0xaee42c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 125, {tv_sec=1707185267, tv_nsec=833174000}, 0xffffffff <unfinished ...>
[pid 19659] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid 19659] tgkill(19642, 19659, SIGABRT) = 0
[pid 19659] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=19642, si_uid=0} ---
[pid 19660] <... futex resumed>)        = ?
[pid 19663] <... restart_syscall resumed>) = ?
[pid 19662] <... futex resumed>)        = ?
[pid 19661] <... futex resumed>)        = ?
[pid 19658] <... restart_syscall resumed>) = ?
[pid 19657] <... futex resumed>)        = ?
[pid 19656] <... rt_sigtimedwait resumed> <unfinished ...>) = ?
[pid 19649] <... futex resumed>)        = ?
[pid 19648] <... poll resumed> <unfinished ...>) = ?
[pid 19647] <... futex resumed>)        = ?
[pid 19645] <... futex resumed>)        = ?
[pid 19642] <... futex resumed>)        = ?
[pid 19660] +++ killed by SIGABRT +++
[pid 19663] +++ killed by SIGABRT +++
[pid 19662] +++ killed by SIGABRT +++
[pid 19661] +++ killed by SIGABRT +++
[pid 19658] +++ killed by SIGABRT +++
[pid 19657] +++ killed by SIGABRT +++
[pid 19656] +++ killed by SIGABRT +++
[pid 19649] +++ killed by SIGABRT +++
[pid 19648] +++ killed by SIGABRT +++
[pid 19647] +++ killed by SIGABRT +++
[pid 19645] +++ killed by SIGABRT +++
[pid 19659] +++ killed by SIGABRT +++
+++ killed by SIGABRT +++
  1. When I fixed $SGE_ARCH = lx-amd64, qrsh / qhost will work fine.

Please fix qmster crash issue, thanks

eddiewang927 commented 7 months ago

I tried to use ge2011.11p1 of binaries and qmaster crashed

Client

SOGE binary

{Clinet}/tmp> which qhost /sge/soge/bin/linux-x64/qhost

OGS binary

{Clinet}/tmp/> /sge/ge2011.11p1/bin/linux-x64/qhost error: commlib error: got read error (closing "soge/qmaster/1") error: commlib error: got select error (Connection refused) error: unable to send message to qmaster using port 6444 on host "soge": got send error

Server

master message

02/06/2024 13:26:20|listen|soge|W|denied: client (Client/qhost/1) uses old GDI version 6.2u5 while qmaster uses newer version r.

qmaster status

[root@soge qmaster]# systemctl status sgemaster

● sgemaster.service - Grid Engine qmaster Loaded: loaded (/etc/systemd/system/sgemaster.service; enabled; vendor preset: disabled) Active: failed (Result: signal) since Tue 2024-02-06 13:26:20 CST; 13min ago Process: 3971 ExecStart=/sge/soge/default/common/sgemaster (code=exited, status=0/SUCCESS) Main PID: 4033 (code=killed, signal=ABRT)

Feb 06 13:18:32 soge systemd[1]: Starting Grid Engine qmaster... Feb 06 13:18:32 soge sgemaster[3971]: Starting Grid Engine qmaster Feb 06 13:18:32 soge systemd[1]: Started Grid Engine qmaster. Feb 06 13:26:20 soge systemd[1]: sgemaster.service: main process exited, code=killed, status=6/ABRT Feb 06 13:26:20 soge systemd[1]: Unit sgemaster.service entered failed state. Feb 06 13:26:20 soge systemd[1]: sgemaster.service failed.

eddiewang927 commented 7 months ago

core dump

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/nvttool/sge/soge/bin/lx-amd64/sge_qmaster'.
Program terminated with signal 6, Aborted.
#0  0x00007f50d795e387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libcom_err-1.42.9-19.el7.x86_64 libselinux-2.5-15.el7.x86_64 libtirpc-0.2.4-0.16.el7.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) bt
#0  0x00007f50d795e387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f50d795fa78 in __GI_abort () at abort.c:90
#2  0x00000000004eba53 in unknownType (str=str@entry=0x6170ee "lCompare") at /opt/sge-master/source/libs/cull/cull_multitype.c:123
#3  0x00000000004f5944 in lCompare (ep=ep@entry=0x267c680, cp=<optimized out>) at /opt/sge-master/source/libs/cull/cull_where.c:1388
#4  0x00000000004f56dc in lCompare (ep=ep@entry=0x267c680, cp=<optimized out>) at /opt/sge-master/source/libs/cull/cull_where.c:1459
#5  0x00000000004e6fab in lSelectElemDPack (slp=slp@entry=0x267c680, cp=cp@entry=0x7f50a8003980, dp=dp@entry=0x7f50b0004120, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0, elements=elements@entry=0x0)
    at /opt/sge-master/source/libs/cull/cull_db.c:679
#6  0x00000000004e70dd in lSelectDPack (name=name@entry=0x63baf9 "", slp=slp@entry=0x267f470, cp=cp@entry=0x7f50a8003980, dp=<optimized out>, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0, elements=elements@entry=0x0)
    at /opt/sge-master/source/libs/cull/cull_db.c:921
#7  0x00000000004e771b in lSelectHashPack (name=name@entry=0x63baf9 "", slp=slp@entry=0x267f470, cp=cp@entry=0x7f50a8003980, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0)
    at /opt/sge-master/source/libs/cull/cull_db.c:800
#8  0x000000000042e14b in sge_get_configuration (condition=0x7f50a8003980, enumeration=0x7f50a8003c30) at /opt/sge-master/source/daemons/qmaster/configuration_qmaster.c:726
#9  0x0000000000443dbd in sge_c_gdi_get (ao=ao@entry=0x874d58 <gdi_object+728>, packet=packet@entry=0x7f50a80039f0, task=task@entry=0x7f50a8003930, monitor=0x7f50beffcd80) at /opt/sge-master/source/daemons/qmaster/sge_c_gdi.c:367
#10 0x0000000000446993 in sge_c_gdi (ctx=0x7f50b0006520, packet=0x7f50a80039f0, task=task@entry=0x7f50a8003930, answer_list=answer_list@entry=0x7f50a8003948, monitor=monitor@entry=0x7f50beffcd80)
    at /opt/sge-master/source/daemons/qmaster/sge_c_gdi.c:286
#11 0x00000000004a37f1 in sge_worker_main (arg=0x26b74f0) at /opt/sge-master/source/daemons/qmaster/sge_thread_worker.c:304
#12 0x00007f50d7cfdea5 in start_thread (arg=0x7f50beffd700) at pthread_create.c:307
#13 0x00007f50d7a26b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
daimh commented 7 months ago

Is there any special reason to mix both SGEs? They won't be binary-code compatible.

eddiewang927 commented 7 months ago

Is there any special reason to mix both SGEs? They won't be binary-code compatible.

Because I want to migrate from OGS to SOGE, there will be overlapping time. When misused OGS binary is submitted to SOGE, it will cause the SOGE qmaster to crash, whereas the opposite won't happen.

daimh commented 7 months ago

I think the best way is to have a dedicated time for the migration. It is challenging to make both SGEs work with each other.