Closed cgorgulla closed 6 years ago
Update: Further analysis indicated that the problem is not nailgun or java, but it is not completely resolved yet. I will provide an update later.
Further investigation seems to indicate that the cause of the problems is indeed the java/nailgun server. I run it on single compute nodes of a computer cluster, and one needs to specify the memory in advance which one wants to use on that node. After a few hours, my jobs (which also run nailgun) are terminated automatically by the node because the requested memory was exceeded. And shortly before that happens, the nailgun server seems to have crashed.
The first indication of a problem is the following message (probably from the ng client):
recv: Inappropriate ioctl for device
Afterward the calls from ng-client show:
connect: Connection refused
Command exited with non-zero status 230
Regarding the memory, this test job had 8 GB available. The largest part was consumed by the JVM. The heap size of the JVM I had restricted to 2 GB (with the -Xmx option), meaning that there should have been plenty of surplus memory available all the time. However, because my job/processes were suddenly terminated by the node due to out-of-memory issues, there must have been a spike in memory usage, most likely by the JVM/nailgun server. The memory I have monitored with top every 0.1 seconds and stored in a log file. However, there is no spike in memory, the JVM stays a bit below 2 GB of memory (and nothing else uses much memory). Therefore, the memory spike might have happened extremely fast. Also, I have logged the GC activities of the JVM, the analyzed file can be found here: http://gceasy.io/diamondgc-report.jsp?oTxnId_value=a60e7d24-48cd-40da-bcc4-b69b21cd41cb Here are the analyzed GC files of another test run: http://gceasy.io/diamondgc-report.jsp?oTxnId_value=eac8ed93-324a-4a5b-bda0-b1b4edd64f31 As one can see, in that one sharp memory spikes can be observed. My guess is that this is what causes the memory problems.
I also logged other JVM warnings, but there is nothing in the log files.
Do you think this problem is related to the nailgun server or rather the JVM itself? And do you have suggestions on how to prevent these memory spikes which cause the jobs to crash?
I have tried OpenJDK-10 instead of version 11, same problem.
We have not tested Nailgun on anything upwards of Java 8, but I hardly imagine the problem is in nailgun/jvm pair. What makes you think that Nailgun is the cause of a memory spike? Do you have an option to run your app on Java 8?
I would suggest to start java with the option to save heap dump and thread dump on crash and go from there.
Thank you very much for your answer.
I am not sure what the problem is, but it seems to me that the memory spike comes from java and/or nailgun, because it crashes shortly before my jobs terminate with the out-of-memory error, and because in one of the garbage collection log files one could observe sudden spikes (see above).
It should be no problem to run my jobs again with Java 8 (after all, there is a binary distribution which can be installed locally). The heap-dump option I have tried before, but I did not obtain any dumps during the crashes. Maybe it works with Java8, and I will now also try the thread-dump as you suggested.
Java 8 did not help, same problem (out of memory - OOM errors). Also, I did not obtain any crash dumps for the heap or the threads.
I have now logged the memory usage of the test job with the ps command every 0.01 seconds.
Here is a snapshot of how it looks like at the beginning of my jobs during normal runtime (several hours before the crash):
PID %MEM RSS VSZ COMMAND USER
115787 1.3 1831184 8749864 java cgorgulla
115606 0.0 5488 253120 srun cgorgulla
39386 0.0 4996 128632 bash cgorgulla
86844 0.0 4048 132432 obabel cgorgulla
164001 0.0 3824 128624 bash cgorgulla
86798 0.0 3804 128632 bash cgorgulla
39381 0.0 3176 183008 sshd cgorgulla
86808 0.0 2652 168544 top cgorgulla
103033 0.0 1904 113416 bash cgorgulla
86854 0.0 1792 161336 ps cgorgulla
115935 0.0 1692 10292 one-step.sh cgorgulla
115942 0.0 1692 10292 one-step.sh cgorgulla
115621 0.0 1688 9676 one-step.sh cgorgulla
115901 0.0 1688 10292 one-step.sh cgorgulla
115972 0.0 1688 10200 one-step.sh cgorgulla
115984 0.0 1688 10204 one-step.sh cgorgulla
115893 0.0 1684 10200 one-step.sh cgorgulla
115913 0.0 1684 10200 one-step.sh cgorgulla
115968 0.0 1684 10200 one-step.sh cgorgulla
86835 0.0 1496 10292 one-step.sh cgorgulla
115608 0.0 860 48364 srun cgorgulla
62163 0.0 724 6448 ng cgorgulla
74090 0.0 724 6448 ng cgorgulla
81691 0.0 720 6448 ng cgorgulla
82537 0.0 720 6448 ng cgorgulla
85912 0.0 720 6448 ng cgorgulla
66274 0.0 716 6448 ng cgorgulla
74088 0.0 684 8604 timeout cgorgulla
85908 0.0 684 8604 timeout cgorgulla
Now here the last snapshot where everything is normal, which is shortly before the crash of the job.
PID %MEM RSS VSZ COMMAND USER
139443 1.2 1651768 8622936 java cgorgulla
123601 0.1 163352 523068 obabel cgorgulla
139355 0.0 5488 253120 srun cgorgulla
125747 0.0 5252 365088 obabel cgorgulla
125757 0.0 5252 365088 obabel cgorgulla
125388 0.0 5224 365088 obabel cgorgulla
125824 0.0 3764 267736 obabel cgorgulla
21062 0.0 3724 128628 bash cgorgulla
125778 0.0 3628 267736 obabel cgorgulla
127018 0.0 1904 113416 bash cgorgulla
126127 0.0 1812 161476 ps cgorgulla
139526 0.0 1740 10288 one-step.sh cgorgulla
139508 0.0 1736 10252 one-step.sh cgorgulla
139473 0.0 1728 10256 one-step.sh cgorgulla
139477 0.0 1728 10252 one-step.sh cgorgulla
139558 0.0 1724 10252 one-step.sh cgorgulla
139585 0.0 1724 10252 one-step.sh cgorgulla
139539 0.0 1704 10292 one-step.sh cgorgulla
139370 0.0 1688 9676 one-step.sh cgorgulla
139485 0.0 1688 10200 one-step.sh cgorgulla
125742 0.0 1544 10252 one-step.sh cgorgulla
125752 0.0 1532 10252 one-step.sh cgorgulla
125772 0.0 1532 10256 one-step.sh cgorgulla
125819 0.0 1532 10252 one-step.sh cgorgulla
125363 0.0 1508 10292 one-step.sh cgorgulla
123586 0.0 1496 10200 one-step.sh cgorgulla
139357 0.0 860 48364 srun cgorgulla
104975 0.0 724 6448 ng cgorgulla
91240 0.0 720 6448 ng cgorgulla
One instance/snapshot later the java/nailgun server suddenly disappeared from the ps output:
PID %MEM RSS VSZ COMMAND USER
123601 0.1 163340 523068 obabel cgorgulla
139355 0.0 5488 253120 srun cgorgulla
125747 0.0 5236 365088 obabel cgorgulla
125757 0.0 5236 365088 obabel cgorgulla
125388 0.0 5208 365088 obabel cgorgulla
21062 0.0 3724 128628 bash cgorgulla
125824 0.0 3724 267736 obabel cgorgulla
125778 0.0 3608 267736 obabel cgorgulla
127018 0.0 1904 113416 bash cgorgulla
126131 0.0 1812 161476 ps cgorgulla
139526 0.0 1740 10288 one-step.sh cgorgulla
139508 0.0 1736 10252 one-step.sh cgorgulla
139473 0.0 1728 10256 one-step.sh cgorgulla
139477 0.0 1728 10252 one-step.sh cgorgulla
139558 0.0 1724 10252 one-step.sh cgorgulla
139585 0.0 1724 10252 one-step.sh cgorgulla
139539 0.0 1704 10292 one-step.sh cgorgulla
139370 0.0 1688 9676 one-step.sh cgorgulla
139485 0.0 1688 10200 one-step.sh cgorgulla
125742 0.0 1544 10252 one-step.sh cgorgulla
125752 0.0 1532 10252 one-step.sh cgorgulla
125772 0.0 1532 10256 one-step.sh cgorgulla
125819 0.0 1532 10252 one-step.sh cgorgulla
125363 0.0 1508 10292 one-step.sh cgorgulla
123586 0.0 1496 10200 one-step.sh cgorgulla
139357 0.0 860 48364 srun cgorgulla
104975 0.0 724 6448 ng cgorgulla
91240 0.0 720 6448 ng cgorgulla
125755 0.0 688 13296 sed cgorgulla
A few seconds later, my job/workflow is completely terminated by the node/batchsystem due to out of memory (OOM). My jobs had 8 GB of total memory available. The maximum of memory which my jobs use according to the ps-log is 1.4 % of the total memory, i.e. less than 2 GB.
What would you recommend to do next?
This is interesting, the dmesg part where the oom problem occured:
[Thu Nov 1 03:40:17 2018] obabel invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[Thu Nov 1 03:40:17 2018] obabel cpuset=step_0 mems_allowed=0-1
[Thu Nov 1 03:40:17 2018] CPU: 29 PID: 123601 Comm: obabel Tainted: G OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[Thu Nov 1 03:40:17 2018] Hardware name: Dell Inc. PowerEdge M630/0R10KJ, BIOS 2.5.4 08/17/2017
[Thu Nov 1 03:40:17 2018] Call Trace:
[Thu Nov 1 03:40:17 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[Thu Nov 1 03:40:17 2018] [<ffffffff816a9b90>] dump_header+0x90/0x229
[Thu Nov 1 03:40:17 2018] [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
[Thu Nov 1 03:40:17 2018] [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
[Thu Nov 1 03:40:17 2018] [<ffffffff811f5fb8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[Thu Nov 1 03:40:17 2018] [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
[Thu Nov 1 03:40:17 2018] [<ffffffff811f9cd6>] mem_cgroup_oom_synchronize+0x546/0x570
[Thu Nov 1 03:40:17 2018] [<ffffffff811f9150>] ? mem_cgroup_charge_common+0xc0/0xc0
[Thu Nov 1 03:40:17 2018] [<ffffffff8118b114>] pagefault_out_of_memory+0x14/0x90
[Thu Nov 1 03:40:17 2018] [<ffffffff816a7f2e>] mm_fault_error+0x68/0x12b
[Thu Nov 1 03:40:17 2018] [<ffffffff816bb741>] __do_page_fault+0x391/0x450
[Thu Nov 1 03:40:17 2018] [<ffffffff816bb835>] do_page_fault+0x35/0x90
[Thu Nov 1 03:40:17 2018] [<ffffffff816b7768>] page_fault+0x28/0x30
[Thu Nov 1 03:40:17 2018] Task in /slurm/uid_11122/job_57832937/step_0 killed as a result of limit of /slurm/uid_11122/job_57832937
[Thu Nov 1 03:40:17 2018] memory: usage 8388608kB, limit 8388608kB, failcnt 363061
[Thu Nov 1 03:40:17 2018] memory+swap: usage 8388608kB, limit 16777216kB, failcnt 0
[Thu Nov 1 03:40:17 2018] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu Nov 1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu Nov 1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_extern: cache:152KB rss:3944KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3936KB inactive_file:76KB active_file:76KB unevictable:0KB
[Thu Nov 1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_batch: cache:0KB rss:4760KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:4760KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu Nov 1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_0: cache:6554284KB rss:1825468KB rss_huge:401408KB mapped_file:13556KB swap:0KB inactive_anon:439516KB active_anon:7937116KB inactive_file:1500KB active_file:1476KB unevictable:0KB
[Thu Nov 1 03:40:17 2018] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu Nov 1 03:40:17 2018] [127018] 11122 127018 28354 476 12 0 0 bash
[Thu Nov 1 03:40:17 2018] [139355] 11122 139355 63280 1372 33 0 0 srun
[Thu Nov 1 03:40:17 2018] [139357] 11122 139357 12091 215 25 0 0 srun
[Thu Nov 1 03:40:17 2018] [139370] 11122 139370 2419 422 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139443] 11122 139443 2155734 412939 953 0 0 java
[Thu Nov 1 03:40:17 2018] [139473] 11122 139473 2564 432 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139477] 11122 139477 2563 432 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139485] 11122 139485 2550 422 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139508] 11122 139508 2563 434 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139526] 11122 139526 2572 435 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139539] 11122 139539 2573 426 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139558] 11122 139558 2563 431 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [139585] 11122 139585 2563 431 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [21062] 11122 21062 32157 931 14 0 0 bash
[Thu Nov 1 03:40:17 2018] [91238] 11122 91238 2151 170 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [91239] 11122 91239 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [91240] 11122 91240 1612 180 9 0 0 ng
[Thu Nov 1 03:40:17 2018] [104964] 11122 104964 2151 171 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [104969] 11122 104969 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [104975] 11122 104975 1612 181 8 0 0 ng
[Thu Nov 1 03:40:17 2018] [123586] 11122 123586 2550 374 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [123592] 11122 123592 2151 171 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [123593] 11122 123593 3325 171 12 0 0 sed
[Thu Nov 1 03:40:17 2018] [123596] 11122 123596 1081 89 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [123601] 11122 123601 130767 40835 261 0 0 obabel
[Thu Nov 1 03:40:17 2018] [125363] 11122 125363 2573 377 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [125369] 11122 125369 2151 171 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [125372] 11122 125372 1089 81 8 0 0 uniq
[Thu Nov 1 03:40:17 2018] [125373] 11122 125373 3324 171 11 0 0 sed
[Thu Nov 1 03:40:17 2018] [125380] 11122 125380 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [125388] 11122 125388 91272 1302 179 0 0 obabel
[Thu Nov 1 03:40:17 2018] [125742] 11122 125742 2563 386 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [125743] 11122 125743 2151 170 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [125744] 11122 125744 1089 81 8 0 0 uniq
[Thu Nov 1 03:40:17 2018] [125745] 11122 125745 3324 171 12 0 0 sed
[Thu Nov 1 03:40:17 2018] [125746] 11122 125746 1081 87 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [125747] 11122 125747 91272 1309 180 0 0 obabel
[Thu Nov 1 03:40:17 2018] [125752] 11122 125752 2563 383 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [125753] 11122 125753 2151 170 9 0 0 timeout
[Thu Nov 1 03:40:17 2018] [125754] 11122 125754 1089 82 9 0 0 uniq
[Thu Nov 1 03:40:17 2018] [125755] 11122 125755 3324 172 11 0 0 sed
[Thu Nov 1 03:40:17 2018] [125756] 11122 125756 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [125757] 11122 125757 91272 1309 179 0 0 obabel
[Thu Nov 1 03:40:17 2018] [125772] 11122 125772 2564 383 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [125773] 11122 125773 2151 170 10 0 0 timeout
[Thu Nov 1 03:40:17 2018] [125774] 11122 125774 1088 86 7 0 0 uniq
[Thu Nov 1 03:40:17 2018] [125775] 11122 125775 3324 172 12 0 0 sed
[Thu Nov 1 03:40:17 2018] [125776] 11122 125776 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [125778] 11122 125778 66934 902 131 0 0 obabel
[Thu Nov 1 03:40:17 2018] [125819] 11122 125819 2563 383 10 0 0 one-step.sh
[Thu Nov 1 03:40:17 2018] [125820] 11122 125820 2151 171 11 0 0 timeout
[Thu Nov 1 03:40:17 2018] [125821] 11122 125821 1088 87 8 0 0 uniq
[Thu Nov 1 03:40:17 2018] [125822] 11122 125822 3324 172 10 0 0 sed
[Thu Nov 1 03:40:17 2018] [125823] 11122 125823 1081 88 8 0 0 time_bin
[Thu Nov 1 03:40:17 2018] [125824] 11122 125824 66934 931 132 0 0 obabel
[Thu Nov 1 03:40:17 2018] [126131] 11122 126131 40335 445 33 0 0 ps
[Thu Nov 1 03:40:17 2018] [126132] 11122 126132 26980 166 10 0 0 head
[Thu Nov 1 03:40:17 2018] [126133] 11122 126133 26990 153 10 0 0 column
[Thu Nov 1 03:40:17 2018] Memory cgroup out of memory: Kill process 125649 (NGSession 36387) score 197 or sacrifice child
[Thu Nov 1 03:40:17 2018] Killed process 139443 (java) total-vm:8622936kB, anon-rss:1637312kB, file-rss:960kB, shmem-rss:13484kB
I'm yet trying to make sense of this, but it seems that obabel caused the problem, which then led to the termination to the java/nailgun-server, even though obabel was confined in memory with ulimit...
I think the topic can be closed since it seems the nailgun server was rather a victim rather than the cause of the problem.
In a workflow which runs for days, nailgun is used as a standalone server, and often used via the ng-client (including parallel/simultaneous usage). It all works perfectly well, but at some time the ng-server just seems to crash silently (or the JVM and thus also the ng-server). No error message or indication what the problem is. How can I find out about the cause?
When I start the ng-server, I limit the java heap size (to prevent that java uses too much memory, which would cause the batchsystem jobs to crash):
java -Xmx5G com.martiansoftware.nailgun.NGServer localhost:${NG_PORT} &
As far as I can tell the memory limitation is not the problem, because sometimes the ng-server crashes while java has used much less memory than the limit. But this is not completely sure, because the memory monitoring is based on polling, and thus could miss a sudden spike in memory usage.
The JDK which I use is openjdk-11_linux-x64_bin.
Any help is appreciated.