Closed yangliping closed 2 years ago
Hello! What is your Operating System? Did you only run sge_master? Or did you run sgeexecd too?
@yangliping I found in your logfile - zlib-1.2.7-18.el7.x86_64. You use Red Hat based OS. What is distribution?
Hi Maksim,
Thank you for your reply. Why do you think it's related to the OS? Here's the OS and kernel information.
# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
# uname -r
3.10.0-1127.el7.x86_64
I run the sge_qmaster only in the pre-production system. But I now run both sge_qmaster and sge_execd in my test environment. It happens in both systems. It runs normally until a job is submitted to the scheduler, then sge_qmaster crash and won't start up anymore. If I remove all job files in /opt/gridengine/default/spool/qmaster/jobs folder, then sge_qmaster can startup.
@yangliping I asked you about OS because more information can help to solve problem.
How did you install SGE?
Did you build from source after that you installed into your machine?
Could you attach output of this - qacct -j "*" | grep qname | wc -l
?
Actually, I'm running son of grid engine v8.1.10. My colleague builds the rpm using the spec file in the source code and installs it using the rpm.
No job finishes normally and there aren't any accounting files.
$ qacct -j "*" | grep qname | wc -l
/opt/gridengine/default/common/accounting: No such file or directory
0
Okay.
What does qhost
show?
Both qhost
and qconf
are good. Here's what it shows in my test environment.
[test@yt-test-1-1 ~]$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
yt-test-1-1 lx-amd64 2 4 4 4 0.01 15.5G 1.1G 512.0M 0.0
[test@yt-test-1-1 ~]$ qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info false
flush_submit_sec 0
flush_finish_sec 0
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 0.010000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 1.000000
max_reservation 0
default_duration INFINITY
I change the code to let sge_create_cull_order_pos()
return an order_pos_t
object instead of NULL. It looks OK now. Thanks.
diff --git a/source/libs/sgeobj/sge_order.c b/source/libs/sgeobj/sge_order.c
index 6c3f0b3aa..7172bb5af 100644
--- a/source/libs/sgeobj/sge_order.c
+++ b/source/libs/sgeobj/sge_order.c
@@ -94,8 +94,6 @@ sge_create_cull_order_pos(order_pos_t **cull_order_pos, lListElem *jep, lListEle
if (*cull_order_pos != NULL) {
sge_free(&cull_order_pos);
- } else {
- return; /* Fixme: Is that right? */
}
*cull_order_pos = sge_malloc(sizeof(order_pos_t));
@yangliping what version sge do you use? What is commit which is used by compiling? In this repo these lines are already removed - https://github.com/daimh/sge/blob/master/source/libs/sgeobj/sge_order.c#L97
I use a mirror repository of the Son of Grid Engine in gitlab.com.
https://gitlab.com/loveshack/sge/-/blob/master/source/libs/sgeobj/sge_order.c#L97
@yangliping oh, of course, if you ask here about sge you should use version from this repo:) I was glad to help you! See you!
@daimh could you close issue? This project doesn't have this problem.
I'm sorry I forgot to mention that. I should definitely use the version from this repo as soon as possible. It happen to have this issue in our system. I need to fix it instead of switching to another version immediately. We've used that version for a long time. It works well until we encounter this issue suddenly. Thanks again for maintaining the project. Best regards.
Hi,
It's really appreciated for your work on maintaining and improving SGE. Do you mind helping me with the segfault issue in our system?
Here's where it got segfault:
And the full backtrace info:
You can get the full debug output from sge_debug.txt.
Thanks in advance.