d2iq-archive / mesos-deb-packaging

Mesos package for Debian, Ubuntu, CentOS, RHEL, and Fedora
Other
58 stars 66 forks source link

mesos-slave service sometimes does not startup #76

Open ryoichitaniguchi opened 8 years ago

ryoichitaniguchi commented 8 years ago

Hi experts,

Currently I run 10 mesos-slaves on ubuntu-trusty using latest deb package(version 0.28.0-2.0.16)

I installed and configured "mesos" package by using ansible role (https://github.com/AnsibleShipyard/ansible-mesos) But it uses default configuration (of this, mesosphere/mesos-deb-packaging)

I noticed that after my scheduled OS reboot, sometimes a few of them (2-3 of 10 total) fail launch mesos-slave with below error.

dmesg:

[  138.321970] init: mesos-slave main process (2811) killed by ABRT signal
[  138.321977] init: mesos-slave main process ended, respawning
[  138.341067] init: mesos-slave main process (2835) killed by ABRT signal
[  138.341075] init: mesos-slave main process ended, respawning
[  138.359238] init: mesos-slave main process (2851) killed by ABRT signal
[  138.359254] init: mesos-slave main process ended, respawning
[  138.377498] init: mesos-slave main process (2867) killed by ABRT signal
[  138.377507] init: mesos-slave main process ended, respawning
[  138.395897] init: mesos-slave main process (2883) killed by ABRT signal
[  138.395906] init: mesos-slave main process ended, respawning
[  138.414475] init: mesos-slave main process (2899) killed by ABRT signal
[  138.414483] init: mesos-slave main process ended, respawning
[  138.432855] init: mesos-slave main process (2915) killed by ABRT signal
[  138.432863] init: mesos-slave main process ended, respawning
[  138.451119] init: mesos-slave main process (2932) killed by ABRT signal
[  138.451127] init: mesos-slave main process ended, respawning
[  138.469644] init: mesos-slave main process (2948) killed by ABRT signal
[  138.469652] init: mesos-slave main process ended, respawning
[  138.488002] init: mesos-slave main process (2964) killed by ABRT signal
[  138.488010] init: mesos-slave main process ended, respawning
[  138.506841] init: mesos-slave main process (2980) killed by ABRT signal
[  138.506849] init: mesos-slave respawning too fast, stopped

All of slaves which faces this issue printed out below syslog.

corresponding code: https://github.com/apache/mesos/blob/845fa6abdc163676cde225e2dc72cee9e3e964f5/3rdparty/libprocess/src/process.cpp#L889

I bet, it likely EADDRNOTAVAIL (errno=99) occured on bind() ? (-> interface is not ready to be used for bind??) :

syslog:

Apr  4 01:26:15 jptolx10221 mesos-slave[2822]: WARNING: Logging before InitGoogleLogging() is written to STDERR
Apr  4 01:26:15 jptolx10221 mesos-slave[2822]: F0404 01:26:15.485184  2822 process.cpp:889] Failed to initialize: Failed to bind on 10.XX.XX.XX:5051: Cannot assign requested address: Cannot assign requested address [99]
Apr  4 01:26:15 jptolx10221 mesos-slave[2822]: *** Check failure stack trace: ***
・・・
Apr  4 01:26:15 jptolx10221 kernel: [  137.778244] init: mesos-slave main process (2973) killed by ABRT signal
Apr  4 01:26:15 jptolx10221 kernel: [  137.778253] init: mesos-slave main process ended, respawning
Apr  4 01:26:15 jptolx10221 mesos-slave[2989]: WARNING: Logging before InitGoogleLogging() is written to STDERR
Apr  4 01:26:15 jptolx10221 mesos-slave[2989]: F0404 01:26:15.667261  2989 process.cpp:889] Failed to initialize: Failed to bind on 10.XX.XX.XX:5051: Cannot assign requested address: Cannot assign requested address [99]
Apr  4 01:26:15 jptolx10221 mesos-slave[2989]: *** Check failure stack trace: ***
Apr  4 01:26:15 jptolx10221 kernel: [  137.796619] init: mesos-slave main process (2989) killed by ABRT signal
Apr  4 01:26:15 jptolx10221 kernel: [  137.796628] init: mesos-slave respawning too fast, stopped

Actually this can be recovered with service mesos-slave start but could I avoid that ? appreciate someone fix upstart script

regards

ryoichitaniguchi commented 8 years ago

For the below error, I internally uploaded 18934f5e4f4d07137f7aeaa1f137e22985b1350c, to delay launching mesos-* service on startup, hope someone kindly review it

Apr 4 01:26:15 jptolx10221 mesos-slave[2989]: F0404 01:26:15.667261 2989 process.cpp:889] Failed to initialize: Failed to bind on 10.XX.XX.XX:5051: Cannot assign requested address: Cannot assign requested address [99]