Closed eddiewang927 closed 8 months ago
I tried to use ge2011.11p1 of binaries and qmaster crashed
SOGE binary
{Clinet}/tmp> which qhost /sge/soge/bin/linux-x64/qhost
OGS binary
{Clinet}/tmp/> /sge/ge2011.11p1/bin/linux-x64/qhost error: commlib error: got read error (closing "soge/qmaster/1") error: commlib error: got select error (Connection refused) error: unable to send message to qmaster using port 6444 on host "soge": got send error
master message
02/06/2024 13:26:20|listen|soge|W|denied: client (Client/qhost/1) uses old GDI version 6.2u5 while qmaster uses newer version r.
qmaster status
[root@soge qmaster]# systemctl status sgemaster
● sgemaster.service - Grid Engine qmaster Loaded: loaded (/etc/systemd/system/sgemaster.service; enabled; vendor preset: disabled) Active: failed (Result: signal) since Tue 2024-02-06 13:26:20 CST; 13min ago Process: 3971 ExecStart=/sge/soge/default/common/sgemaster (code=exited, status=0/SUCCESS) Main PID: 4033 (code=killed, signal=ABRT)
Feb 06 13:18:32 soge systemd[1]: Starting Grid Engine qmaster... Feb 06 13:18:32 soge sgemaster[3971]: Starting Grid Engine qmaster Feb 06 13:18:32 soge systemd[1]: Started Grid Engine qmaster. Feb 06 13:26:20 soge systemd[1]: sgemaster.service: main process exited, code=killed, status=6/ABRT Feb 06 13:26:20 soge systemd[1]: Unit sgemaster.service entered failed state. Feb 06 13:26:20 soge systemd[1]: sgemaster.service failed.
core dump
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/nvttool/sge/soge/bin/lx-amd64/sge_qmaster'.
Program terminated with signal 6, Aborted.
#0 0x00007f50d795e387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libcom_err-1.42.9-19.el7.x86_64 libselinux-2.5-15.el7.x86_64 libtirpc-0.2.4-0.16.el7.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) bt
#0 0x00007f50d795e387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f50d795fa78 in __GI_abort () at abort.c:90
#2 0x00000000004eba53 in unknownType (str=str@entry=0x6170ee "lCompare") at /opt/sge-master/source/libs/cull/cull_multitype.c:123
#3 0x00000000004f5944 in lCompare (ep=ep@entry=0x267c680, cp=<optimized out>) at /opt/sge-master/source/libs/cull/cull_where.c:1388
#4 0x00000000004f56dc in lCompare (ep=ep@entry=0x267c680, cp=<optimized out>) at /opt/sge-master/source/libs/cull/cull_where.c:1459
#5 0x00000000004e6fab in lSelectElemDPack (slp=slp@entry=0x267c680, cp=cp@entry=0x7f50a8003980, dp=dp@entry=0x7f50b0004120, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0, elements=elements@entry=0x0)
at /opt/sge-master/source/libs/cull/cull_db.c:679
#6 0x00000000004e70dd in lSelectDPack (name=name@entry=0x63baf9 "", slp=slp@entry=0x267f470, cp=cp@entry=0x7f50a8003980, dp=<optimized out>, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0, elements=elements@entry=0x0)
at /opt/sge-master/source/libs/cull/cull_db.c:921
#7 0x00000000004e771b in lSelectHashPack (name=name@entry=0x63baf9 "", slp=slp@entry=0x267f470, cp=cp@entry=0x7f50a8003980, enp=enp@entry=0x7f50a8003c30, isHash=isHash@entry=false, pb=pb@entry=0x0)
at /opt/sge-master/source/libs/cull/cull_db.c:800
#8 0x000000000042e14b in sge_get_configuration (condition=0x7f50a8003980, enumeration=0x7f50a8003c30) at /opt/sge-master/source/daemons/qmaster/configuration_qmaster.c:726
#9 0x0000000000443dbd in sge_c_gdi_get (ao=ao@entry=0x874d58 <gdi_object+728>, packet=packet@entry=0x7f50a80039f0, task=task@entry=0x7f50a8003930, monitor=0x7f50beffcd80) at /opt/sge-master/source/daemons/qmaster/sge_c_gdi.c:367
#10 0x0000000000446993 in sge_c_gdi (ctx=0x7f50b0006520, packet=0x7f50a80039f0, task=task@entry=0x7f50a8003930, answer_list=answer_list@entry=0x7f50a8003948, monitor=monitor@entry=0x7f50beffcd80)
at /opt/sge-master/source/daemons/qmaster/sge_c_gdi.c:286
#11 0x00000000004a37f1 in sge_worker_main (arg=0x26b74f0) at /opt/sge-master/source/daemons/qmaster/sge_thread_worker.c:304
#12 0x00007f50d7cfdea5 in start_thread (arg=0x7f50beffd700) at pthread_create.c:307
#13 0x00007f50d7a26b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Is there any special reason to mix both SGEs? They won't be binary-code compatible.
Is there any special reason to mix both SGEs? They won't be binary-code compatible.
Because I want to migrate from OGS to SOGE, there will be overlapping time. When misused OGS binary is submitted to SOGE, it will cause the SOGE qmaster to crash, whereas the opposite won't happen.
I think the best way is to have a dedicated time for the migration. It is challenging to make both SGEs work with each other.
Hi I have 2 grid Open Grid Scheduler 2011.11p1(OGS) and some grid engine(SOGE) in my HPC.
QMASTER crash test step by step
Submit host terminal msg
qmaster host
Please fix qmster crash issue, thanks