sge_qmaster segfault at sge_create_orders

yangliping commented 2 years ago

Hi,

It's really appreciated for your work on maintaining and improving SGE. Do you mind helping me with the segfault issue in our system?

Here's where it got segfault:

sge_create_orders (or_list=or_list@entry=0x7fffb8024390, type=type@entry=3, job=job@entry=0x7fffb8007e70, ja_task=0x7fffb8009040,
    granted=granted@entry=0x0, update_execd=update_execd@entry=false) at ../libs/sched/sge_orders.c:286
286              lSetPosDouble(tempElem, order_ja_pos->JAT_tix_pos,     lGetPosDouble(ja_task,ja_pos->JAT_tix_pos));

And the full backtrace info:

#0  sge_create_orders (or_list=or_list@entry=0x7fffb8024390, type=type@entry=3, job=job@entry=0x7fffb8007e70, ja_task=0x7fffb8009040,
    granted=granted@entry=0x0, update_execd=update_execd@entry=false) at ../libs/sched/sge_orders.c:286
        tlist = 0x7fffb8025800
        tempElem = 0x7fffb80258e0
        tix2Desc = {{nm = 141150, mt = 2097155, ht = 0x0}, {nm = 141169, mt = 2097154, ht = 0x0}, {nm = 141170, mt = 2097154, ht = 0x0}, {
            nm = 141171, mt = 2097154, ht = 0x0}, {nm = 141172, mt = 2097154, ht = 0x0}, {nm = 141173, mt = 2097154, ht = 0x0}, {nm = 141181,
            mt = 2097154, ht = 0x0}, {nm = 141182, mt = 2097154, ht = 0x0}, {nm = -1, mt = 2097152, ht = 0x0}}
        ja_pos = 0x0
        order_ja_pos = 0x20
        tixDesc = {{nm = 141150, mt = 2097155, ht = 0x0}, {nm = 141169, mt = 2097154, ht = 0x0}, {nm = 141170, mt = 2097154, ht = 0x0}, {nm = 141171,
            mt = 2097154, ht = 0x0}, {nm = 141172, mt = 2097154, ht = 0x0}, {nm = 141173, mt = 2097154, ht = 0x0}, {nm = 141181, mt = 2097154,
            ht = 0x0}, {nm = 141182, mt = 2097154, ht = 0x0}, {nm = 141157, mt = 2097161, ht = 0x0}, {nm = -1, mt = 2097152, ht = 0x0}}
        jobDesc = {{nm = 127, mt = 2097154, ht = 0x0}, {nm = 126, mt = 2097154, ht = 0x0}, {nm = 125, mt = 2097154, ht = 0x0}, {nm = 128,
            mt = 2097154, ht = 0x0}, {nm = 129, mt = 2097154, ht = 0x0}, {nm = 130, mt = 2097154, ht = 0x0}, {nm = 114, mt = 2097161, ht = 0x0}, {
            nm = -1, mt = 2097152, ht = 0x0}}
        job_pos = 0x40
        jlist = 0x7fffb8025730
        order_pos = 0x0
        order_job_pos = 0x5c
        jep = 0x7fffb8024660
        ql = 0x0
        gel = <optimized out>
        ep = 0x7fffb80245c0
        ep2 = <optimized out>
        qslots = <optimized out>
        SGE_FUNC = "sge_create_orders"
#2  0x000000000043a40c in scheduler_method (evc=<optimized out>, answer_list=answer_list@entry=0x7fffcbffeb88, lists=lists@entry=0x7fffcbffecc0,
    order=order@entry=0x7fffcbffeb80) at ../daemons/qmaster/sge_sched_thread.c:288
        orders = {configOrderList = 0x0, pendingOrderList = 0x7fffb8024390, jobStartOrderList = 0x0, sentOrderList = 0x0, numberSendOrders = 2,
          numberSendPackages = 1}
        splitted_job_lists = {0x7fffcbffe720, 0x7fffcbffe708, 0x7fffcbffe728, 0x7fffcbffe710, 0x7fffcbffe6f8, 0x7fffcbffe740, 0x7fffcbffe738,
          0x7fffcbffe700, 0x7fffcbffe730, 0x7fffcbffe718, 0x7fffcbffe748, 0x7fffcbffe750}
        waiting_due_to_pedecessor_list = 0x0
        waiting_due_to_time_list = 0x0
        pending_excluded_list = 0x0
        suspended_list = 0x0
        finished_list = 0x0
        pending_list = 0x7fffb8008d80
        pending_excludedlist = 0x0
        running_list = 0x0
        error_list = 0x0
        hold_list = 0x0
        not_started_list = 0x0
        deferred_list = 0x0
        prof_job_count = 1
        global_mes_count = 0
        job_mes_count = 0
        i = <optimized out>
        SGE_FUNC = "scheduler_method"

You can get the full debug output from sge_debug.txt.

Thanks in advance.

mperov commented 2 years ago

Hello! What is your Operating System? Did you only run sge_master? Or did you run sgeexecd too?

mperov commented 2 years ago

@yangliping I found in your logfile - zlib-1.2.7-18.el7.x86_64. You use Red Hat based OS. What is distribution?

yangliping commented 2 years ago

Hi Maksim,

Thank you for your reply. Why do you think it's related to the OS? Here's the OS and kernel information.

# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)

# uname -r
3.10.0-1127.el7.x86_64

I run the sge_qmaster only in the pre-production system. But I now run both sge_qmaster and sge_execd in my test environment. It happens in both systems. It runs normally until a job is submitted to the scheduler, then sge_qmaster crash and won't start up anymore. If I remove all job files in /opt/gridengine/default/spool/qmaster/jobs folder, then sge_qmaster can startup.

mperov commented 2 years ago

@yangliping I asked you about OS because more information can help to solve problem. How did you install SGE? Did you build from source after that you installed into your machine? Could you attach output of this - qacct -j "*" | grep qname | wc -l?

yangliping commented 2 years ago

Actually, I'm running son of grid engine v8.1.10. My colleague builds the rpm using the spec file in the source code and installs it using the rpm.

No job finishes normally and there aren't any accounting files.

$ qacct -j "*" | grep qname | wc -l
/opt/gridengine/default/common/accounting: No such file or directory
0

mperov commented 2 years ago

Okay. What does qhost show?

yangliping commented 2 years ago

Both qhost and qconf are good. Here's what it shows in my test environment.

[test@yt-test-1-1 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
yt-test-1-1             lx-amd64        2    4    4    4  0.01   15.5G    1.1G  512.0M     0.0

[test@yt-test-1-1 ~]$ qconf -ssconf
algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY

yangliping commented 2 years ago

I change the code to let sge_create_cull_order_pos() return an order_pos_t object instead of NULL. It looks OK now. Thanks.

diff --git a/source/libs/sgeobj/sge_order.c b/source/libs/sgeobj/sge_order.c
index 6c3f0b3aa..7172bb5af 100644
--- a/source/libs/sgeobj/sge_order.c
+++ b/source/libs/sgeobj/sge_order.c
@@ -94,8 +94,6 @@ sge_create_cull_order_pos(order_pos_t **cull_order_pos, lListElem *jep, lListEle

    if (*cull_order_pos != NULL) {
       sge_free(&cull_order_pos);
-   } else {
-      return;                   /* Fixme: Is that right?  */
    }

    *cull_order_pos = sge_malloc(sizeof(order_pos_t));

mperov commented 2 years ago

@yangliping what version sge do you use? What is commit which is used by compiling? In this repo these lines are already removed - https://github.com/daimh/sge/blob/master/source/libs/sgeobj/sge_order.c#L97

yangliping commented 2 years ago

I use a mirror repository of the Son of Grid Engine in gitlab.com.

https://gitlab.com/loveshack/sge/-/blob/master/source/libs/sgeobj/sge_order.c#L97

mperov commented 2 years ago

@yangliping oh, of course, if you ask here about sge you should use version from this repo:) I was glad to help you! See you!

mperov commented 2 years ago

@daimh could you close issue? This project doesn't have this problem.

yangliping commented 2 years ago

I'm sorry I forgot to mention that. I should definitely use the version from this repo as soon as possible. It happen to have this issue in our system. I need to fix it instead of switching to another version immediately. We've used that version for a long time. It works well until we encounter this issue suddenly. Thanks again for maintaining the project. Best regards.

daimh / sge

sge_qmaster segfault at sge_create_orders #16