facebookarchive / nailgun

Nailgun is a client, protocol, and server for running Java programs from the command line without incurring the JVM startup overhead.
https://github.com/facebook/nailgun
Other
731 stars 138 forks source link

Nailgun server crashes #155

Closed cgorgulla closed 6 years ago

cgorgulla commented 6 years ago

In a workflow which runs for days, nailgun is used as a standalone server, and often used via the ng-client (including parallel/simultaneous usage). It all works perfectly well, but at some time the ng-server just seems to crash silently (or the JVM and thus also the ng-server). No error message or indication what the problem is. How can I find out about the cause?

When I start the ng-server, I limit the java heap size (to prevent that java uses too much memory, which would cause the batchsystem jobs to crash):

java -Xmx5G com.martiansoftware.nailgun.NGServer localhost:${NG_PORT} &

As far as I can tell the memory limitation is not the problem, because sometimes the ng-server crashes while java has used much less memory than the limit. But this is not completely sure, because the memory monitoring is based on polling, and thus could miss a sudden spike in memory usage.

The JDK which I use is openjdk-11_linux-x64_bin.

Any help is appreciated.

cgorgulla commented 6 years ago

Update: Further analysis indicated that the problem is not nailgun or java, but it is not completely resolved yet. I will provide an update later.

cgorgulla commented 6 years ago

Further investigation seems to indicate that the cause of the problems is indeed the java/nailgun server. I run it on single compute nodes of a computer cluster, and one needs to specify the memory in advance which one wants to use on that node. After a few hours, my jobs (which also run nailgun) are terminated automatically by the node because the requested memory was exceeded. And shortly before that happens, the nailgun server seems to have crashed.

The first indication of a problem is the following message (probably from the ng client):

recv: Inappropriate ioctl for device

Afterward the calls from ng-client show:

connect: Connection refused
Command exited with non-zero status 230

Regarding the memory, this test job had 8 GB available. The largest part was consumed by the JVM. The heap size of the JVM I had restricted to 2 GB (with the -Xmx option), meaning that there should have been plenty of surplus memory available all the time. However, because my job/processes were suddenly terminated by the node due to out-of-memory issues, there must have been a spike in memory usage, most likely by the JVM/nailgun server. The memory I have monitored with top every 0.1 seconds and stored in a log file. However, there is no spike in memory, the JVM stays a bit below 2 GB of memory (and nothing else uses much memory). Therefore, the memory spike might have happened extremely fast. Also, I have logged the GC activities of the JVM, the analyzed file can be found here: http://gceasy.io/diamondgc-report.jsp?oTxnId_value=a60e7d24-48cd-40da-bcc4-b69b21cd41cb Here are the analyzed GC files of another test run: http://gceasy.io/diamondgc-report.jsp?oTxnId_value=eac8ed93-324a-4a5b-bda0-b1b4edd64f31 As one can see, in that one sharp memory spikes can be observed. My guess is that this is what causes the memory problems.

I also logged other JVM warnings, but there is nothing in the log files.

Do you think this problem is related to the nailgun server or rather the JVM itself? And do you have suggestions on how to prevent these memory spikes which cause the jobs to crash?

cgorgulla commented 6 years ago

I have tried OpenJDK-10 instead of version 11, same problem.

sbalabanov-zz commented 6 years ago

We have not tested Nailgun on anything upwards of Java 8, but I hardly imagine the problem is in nailgun/jvm pair. What makes you think that Nailgun is the cause of a memory spike? Do you have an option to run your app on Java 8?

I would suggest to start java with the option to save heap dump and thread dump on crash and go from there.

cgorgulla commented 6 years ago

Thank you very much for your answer.

I am not sure what the problem is, but it seems to me that the memory spike comes from java and/or nailgun, because it crashes shortly before my jobs terminate with the out-of-memory error, and because in one of the garbage collection log files one could observe sudden spikes (see above).

It should be no problem to run my jobs again with Java 8 (after all, there is a binary distribution which can be installed locally). The heap-dump option I have tried before, but I did not obtain any dumps during the crashes. Maybe it works with Java8, and I will now also try the thread-dump as you suggested.

cgorgulla commented 6 years ago

Java 8 did not help, same problem (out of memory - OOM errors). Also, I did not obtain any crash dumps for the heap or the threads.

I have now logged the memory usage of the test job with the ps command every 0.01 seconds.

Here is a snapshot of how it looks like at the beginning of my jobs during normal runtime (several hours before the crash):

PID     %MEM  RSS      VSZ      COMMAND      USER
115787  1.3   1831184  8749864  java         cgorgulla
115606  0.0   5488     253120   srun         cgorgulla
39386   0.0   4996     128632   bash         cgorgulla
86844   0.0   4048     132432   obabel       cgorgulla
164001  0.0   3824     128624   bash         cgorgulla
86798   0.0   3804     128632   bash         cgorgulla
39381   0.0   3176     183008   sshd         cgorgulla
86808   0.0   2652     168544   top          cgorgulla
103033  0.0   1904     113416   bash         cgorgulla
86854   0.0   1792     161336   ps           cgorgulla
115935  0.0   1692     10292    one-step.sh  cgorgulla
115942  0.0   1692     10292    one-step.sh  cgorgulla
115621  0.0   1688     9676     one-step.sh  cgorgulla
115901  0.0   1688     10292    one-step.sh  cgorgulla
115972  0.0   1688     10200    one-step.sh  cgorgulla
115984  0.0   1688     10204    one-step.sh  cgorgulla
115893  0.0   1684     10200    one-step.sh  cgorgulla
115913  0.0   1684     10200    one-step.sh  cgorgulla
115968  0.0   1684     10200    one-step.sh  cgorgulla
86835   0.0   1496     10292    one-step.sh  cgorgulla
115608  0.0   860      48364    srun         cgorgulla
62163   0.0   724      6448     ng           cgorgulla
74090   0.0   724      6448     ng           cgorgulla
81691   0.0   720      6448     ng           cgorgulla
82537   0.0   720      6448     ng           cgorgulla
85912   0.0   720      6448     ng           cgorgulla
66274   0.0   716      6448     ng           cgorgulla
74088   0.0   684      8604     timeout      cgorgulla
85908   0.0   684      8604     timeout      cgorgulla

Now here the last snapshot where everything is normal, which is shortly before the crash of the job.

PID     %MEM  RSS      VSZ      COMMAND      USER
139443  1.2   1651768  8622936  java         cgorgulla
123601  0.1   163352   523068   obabel       cgorgulla
139355  0.0   5488     253120   srun         cgorgulla
125747  0.0   5252     365088   obabel       cgorgulla
125757  0.0   5252     365088   obabel       cgorgulla
125388  0.0   5224     365088   obabel       cgorgulla
125824  0.0   3764     267736   obabel       cgorgulla
21062   0.0   3724     128628   bash         cgorgulla
125778  0.0   3628     267736   obabel       cgorgulla
127018  0.0   1904     113416   bash         cgorgulla
126127  0.0   1812     161476   ps           cgorgulla
139526  0.0   1740     10288    one-step.sh  cgorgulla
139508  0.0   1736     10252    one-step.sh  cgorgulla
139473  0.0   1728     10256    one-step.sh  cgorgulla
139477  0.0   1728     10252    one-step.sh  cgorgulla
139558  0.0   1724     10252    one-step.sh  cgorgulla
139585  0.0   1724     10252    one-step.sh  cgorgulla
139539  0.0   1704     10292    one-step.sh  cgorgulla
139370  0.0   1688     9676     one-step.sh  cgorgulla
139485  0.0   1688     10200    one-step.sh  cgorgulla
125742  0.0   1544     10252    one-step.sh  cgorgulla
125752  0.0   1532     10252    one-step.sh  cgorgulla
125772  0.0   1532     10256    one-step.sh  cgorgulla
125819  0.0   1532     10252    one-step.sh  cgorgulla
125363  0.0   1508     10292    one-step.sh  cgorgulla
123586  0.0   1496     10200    one-step.sh  cgorgulla
139357  0.0   860      48364    srun         cgorgulla
104975  0.0   724      6448     ng           cgorgulla
91240   0.0   720      6448     ng           cgorgulla

One instance/snapshot later the java/nailgun server suddenly disappeared from the ps output:

PID     %MEM  RSS     VSZ     COMMAND      USER
123601  0.1   163340  523068  obabel       cgorgulla
139355  0.0   5488    253120  srun         cgorgulla
125747  0.0   5236    365088  obabel       cgorgulla
125757  0.0   5236    365088  obabel       cgorgulla
125388  0.0   5208    365088  obabel       cgorgulla
21062   0.0   3724    128628  bash         cgorgulla
125824  0.0   3724    267736  obabel       cgorgulla
125778  0.0   3608    267736  obabel       cgorgulla
127018  0.0   1904    113416  bash         cgorgulla
126131  0.0   1812    161476  ps           cgorgulla
139526  0.0   1740    10288   one-step.sh  cgorgulla
139508  0.0   1736    10252   one-step.sh  cgorgulla
139473  0.0   1728    10256   one-step.sh  cgorgulla
139477  0.0   1728    10252   one-step.sh  cgorgulla
139558  0.0   1724    10252   one-step.sh  cgorgulla
139585  0.0   1724    10252   one-step.sh  cgorgulla
139539  0.0   1704    10292   one-step.sh  cgorgulla
139370  0.0   1688    9676    one-step.sh  cgorgulla
139485  0.0   1688    10200   one-step.sh  cgorgulla
125742  0.0   1544    10252   one-step.sh  cgorgulla
125752  0.0   1532    10252   one-step.sh  cgorgulla
125772  0.0   1532    10256   one-step.sh  cgorgulla
125819  0.0   1532    10252   one-step.sh  cgorgulla
125363  0.0   1508    10292   one-step.sh  cgorgulla
123586  0.0   1496    10200   one-step.sh  cgorgulla
139357  0.0   860     48364   srun         cgorgulla
104975  0.0   724     6448    ng           cgorgulla
91240   0.0   720     6448    ng           cgorgulla
125755  0.0   688     13296   sed          cgorgulla

A few seconds later, my job/workflow is completely terminated by the node/batchsystem due to out of memory (OOM). My jobs had 8 GB of total memory available. The maximum of memory which my jobs use according to the ps-log is 1.4 % of the total memory, i.e. less than 2 GB.

What would you recommend to do next?

cgorgulla commented 6 years ago

This is interesting, the dmesg part where the oom problem occured:

[Thu Nov  1 03:40:17 2018] obabel invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[Thu Nov  1 03:40:17 2018] obabel cpuset=step_0 mems_allowed=0-1
[Thu Nov  1 03:40:17 2018] CPU: 29 PID: 123601 Comm: obabel Tainted: G           OE  ------------ T 3.10.0-693.21.1.el7.x86_64 #1
[Thu Nov  1 03:40:17 2018] Hardware name: Dell Inc. PowerEdge M630/0R10KJ, BIOS 2.5.4 08/17/2017
[Thu Nov  1 03:40:17 2018] Call Trace:
[Thu Nov  1 03:40:17 2018]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[Thu Nov  1 03:40:17 2018]  [<ffffffff816a9b90>] dump_header+0x90/0x229
[Thu Nov  1 03:40:17 2018]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
[Thu Nov  1 03:40:17 2018]  [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
[Thu Nov  1 03:40:17 2018]  [<ffffffff811f5fb8>] ? try_get_mem_cgroup_from_mm+0x28/0x60
[Thu Nov  1 03:40:17 2018]  [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
[Thu Nov  1 03:40:17 2018]  [<ffffffff811f9cd6>] mem_cgroup_oom_synchronize+0x546/0x570
[Thu Nov  1 03:40:17 2018]  [<ffffffff811f9150>] ? mem_cgroup_charge_common+0xc0/0xc0
[Thu Nov  1 03:40:17 2018]  [<ffffffff8118b114>] pagefault_out_of_memory+0x14/0x90
[Thu Nov  1 03:40:17 2018]  [<ffffffff816a7f2e>] mm_fault_error+0x68/0x12b
[Thu Nov  1 03:40:17 2018]  [<ffffffff816bb741>] __do_page_fault+0x391/0x450
[Thu Nov  1 03:40:17 2018]  [<ffffffff816bb835>] do_page_fault+0x35/0x90
[Thu Nov  1 03:40:17 2018]  [<ffffffff816b7768>] page_fault+0x28/0x30
[Thu Nov  1 03:40:17 2018] Task in /slurm/uid_11122/job_57832937/step_0 killed as a result of limit of /slurm/uid_11122/job_57832937
[Thu Nov  1 03:40:17 2018] memory: usage 8388608kB, limit 8388608kB, failcnt 363061
[Thu Nov  1 03:40:17 2018] memory+swap: usage 8388608kB, limit 16777216kB, failcnt 0
[Thu Nov  1 03:40:17 2018] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu Nov  1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu Nov  1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_extern: cache:152KB rss:3944KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3936KB inactive_file:76KB active_file:76KB unevictable:0KB
[Thu Nov  1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_batch: cache:0KB rss:4760KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:4760KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu Nov  1 03:40:17 2018] Memory cgroup stats for /slurm/uid_11122/job_57832937/step_0: cache:6554284KB rss:1825468KB rss_huge:401408KB mapped_file:13556KB swap:0KB inactive_anon:439516KB active_anon:7937116KB inactive_file:1500KB active_file:1476KB unevictable:0KB
[Thu Nov  1 03:40:17 2018] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Thu Nov  1 03:40:17 2018] [127018] 11122 127018    28354      476      12        0             0 bash
[Thu Nov  1 03:40:17 2018] [139355] 11122 139355    63280     1372      33        0             0 srun
[Thu Nov  1 03:40:17 2018] [139357] 11122 139357    12091      215      25        0             0 srun
[Thu Nov  1 03:40:17 2018] [139370] 11122 139370     2419      422      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139443] 11122 139443  2155734   412939     953        0             0 java
[Thu Nov  1 03:40:17 2018] [139473] 11122 139473     2564      432      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139477] 11122 139477     2563      432      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139485] 11122 139485     2550      422      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139508] 11122 139508     2563      434      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139526] 11122 139526     2572      435      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139539] 11122 139539     2573      426      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139558] 11122 139558     2563      431      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [139585] 11122 139585     2563      431      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [21062] 11122 21062    32157      931      14        0             0 bash
[Thu Nov  1 03:40:17 2018] [91238] 11122 91238     2151      170      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [91239] 11122 91239     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [91240] 11122 91240     1612      180       9        0             0 ng
[Thu Nov  1 03:40:17 2018] [104964] 11122 104964     2151      171      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [104969] 11122 104969     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [104975] 11122 104975     1612      181       8        0             0 ng
[Thu Nov  1 03:40:17 2018] [123586] 11122 123586     2550      374      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [123592] 11122 123592     2151      171      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [123593] 11122 123593     3325      171      12        0             0 sed
[Thu Nov  1 03:40:17 2018] [123596] 11122 123596     1081       89       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [123601] 11122 123601   130767    40835     261        0             0 obabel
[Thu Nov  1 03:40:17 2018] [125363] 11122 125363     2573      377      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [125369] 11122 125369     2151      171      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [125372] 11122 125372     1089       81       8        0             0 uniq
[Thu Nov  1 03:40:17 2018] [125373] 11122 125373     3324      171      11        0             0 sed
[Thu Nov  1 03:40:17 2018] [125380] 11122 125380     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [125388] 11122 125388    91272     1302     179        0             0 obabel
[Thu Nov  1 03:40:17 2018] [125742] 11122 125742     2563      386      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [125743] 11122 125743     2151      170      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [125744] 11122 125744     1089       81       8        0             0 uniq
[Thu Nov  1 03:40:17 2018] [125745] 11122 125745     3324      171      12        0             0 sed
[Thu Nov  1 03:40:17 2018] [125746] 11122 125746     1081       87       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [125747] 11122 125747    91272     1309     180        0             0 obabel
[Thu Nov  1 03:40:17 2018] [125752] 11122 125752     2563      383      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [125753] 11122 125753     2151      170       9        0             0 timeout
[Thu Nov  1 03:40:17 2018] [125754] 11122 125754     1089       82       9        0             0 uniq
[Thu Nov  1 03:40:17 2018] [125755] 11122 125755     3324      172      11        0             0 sed
[Thu Nov  1 03:40:17 2018] [125756] 11122 125756     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [125757] 11122 125757    91272     1309     179        0             0 obabel
[Thu Nov  1 03:40:17 2018] [125772] 11122 125772     2564      383      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [125773] 11122 125773     2151      170      10        0             0 timeout
[Thu Nov  1 03:40:17 2018] [125774] 11122 125774     1088       86       7        0             0 uniq
[Thu Nov  1 03:40:17 2018] [125775] 11122 125775     3324      172      12        0             0 sed
[Thu Nov  1 03:40:17 2018] [125776] 11122 125776     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [125778] 11122 125778    66934      902     131        0             0 obabel
[Thu Nov  1 03:40:17 2018] [125819] 11122 125819     2563      383      10        0             0 one-step.sh
[Thu Nov  1 03:40:17 2018] [125820] 11122 125820     2151      171      11        0             0 timeout
[Thu Nov  1 03:40:17 2018] [125821] 11122 125821     1088       87       8        0             0 uniq
[Thu Nov  1 03:40:17 2018] [125822] 11122 125822     3324      172      10        0             0 sed
[Thu Nov  1 03:40:17 2018] [125823] 11122 125823     1081       88       8        0             0 time_bin
[Thu Nov  1 03:40:17 2018] [125824] 11122 125824    66934      931     132        0             0 obabel
[Thu Nov  1 03:40:17 2018] [126131] 11122 126131    40335      445      33        0             0 ps
[Thu Nov  1 03:40:17 2018] [126132] 11122 126132    26980      166      10        0             0 head
[Thu Nov  1 03:40:17 2018] [126133] 11122 126133    26990      153      10        0             0 column
[Thu Nov  1 03:40:17 2018] Memory cgroup out of memory: Kill process 125649 (NGSession 36387) score 197 or sacrifice child
[Thu Nov  1 03:40:17 2018] Killed process 139443 (java) total-vm:8622936kB, anon-rss:1637312kB, file-rss:960kB, shmem-rss:13484kB

I'm yet trying to make sense of this, but it seems that obabel caused the problem, which then led to the termination to the java/nailgun-server, even though obabel was confined in memory with ulimit...

cgorgulla commented 6 years ago

I think the topic can be closed since it seems the nailgun server was rather a victim rather than the cause of the problem.