ganglia / monitor-core

Ganglia Monitoring core
BSD 3-Clause "New" or "Revised" License
489 stars 245 forks source link

System D /run/systemd/private #266

Closed Arlion closed 7 years ago

Arlion commented 8 years ago

After extensive troubleshooting I am at an inpass and hoping there is enough information here.

Symptoms: During startup, gmond will attempt to start but would fail. I later discovered the service was pausing for 3 minutes while it waits for a port to open, and then pauses for another 2 minutes.

After startup completes, starting the service completes.

Troubleshooting:

/lib/systemd/system/gmond.service
[Unit]
Description=Ganglia Monitoring Daemon
After=multi-user.target

[Service]
Type=notify
ExecStart=/usr/sbin/gmond
Environment=SYSTEMD_LOG_LEVEL=debug
Requires=dbus.service  ## added to ensure dbus service was up before gmond started.
[Install]
WantedBy=multi-user.target

Does not produce any additional logs (which was still, none)

I finally wrote a script to hook strace to the process on startup and here it is. http://paste.fedoraproject.org/423402/47325873/

Here are a few excerts:

10:07:19 connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, 22) = 0
10:07:19 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
10:07:19 getsockopt(3, SOL_SOCKET, SO_PEERSEC, 0x7f8947fef810, 0x7ffd40deba50) = -1 ENOPROTOOPT (Protocol not available)
10:07:19 fstat(3, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
10:07:19 recvmsg(3, 0x7ffd40dea8a0, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
10:07:19 ppoll([{fd=3, events=POLLIN}], 1, {24, 999975000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999930711})
10:07:19 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\6\0\0\0\17\0\0\0\5\1u\0\3\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
10:07:19 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"\10\1g\0\1v\0\0\1b\0\0\0\0\0\0", 16}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 16
10:07:19 recvmsg(3, 0x7ffd40dea950, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
10:07:19 ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
10:10:01 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1K\0\0\0\7\0\0\0p\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
10:10:01 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., 179}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 179
10:10:01 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1@\0\0\0\10\0\0\0q\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24

Finally the service continues and then pauses again for another two minutes. The link above contains all the logs undedited.

Server details:
CentOS 7.2
Fully up to date
ganglia.x86_64                   3.7.2-2.el7                         @epel/7    
ganglia-gmond.x86_64             3.7.2-2.el7                         @epel/7    
ganglia-gmond-python.x86_64      3.7.2-2.el7                         @epel/7    
systemd.x86_64                   219-19.el7_2.12                     @updates/7 
systemd-libs.x86_64              219-19.el7_2.12                     @updates/7 
systemd-sysv.x86_64              219-19.el7_2.12                     @updates/7 
dbus.x86_64                      1:1.6.12-14.el7_2                   @updates/7 
dbus-glib.x86_64                 0.100-7.el7                         @anaconda/7
dbus-libs.x86_64                 1:1.6.12-14.el7_2                   @updates/7 
dbus-python.x86_64               1.1.1-9.el7                         @anaconda/7
ls -al /run/systemd/private
srwxrwxrwx 1 root root 0 Sep  7 13:49 /run/systemd/private

Thank you for your time.

vvuksan commented 8 years ago

I am wondering whether

After=multi-user.target

should be changed to

After=network-online.target

Arlion commented 7 years ago

Pull request #282 has been created to address this.

Arlion commented 7 years ago

Pull Request #282 has been merged. Closing issue.