Open hwang2014 opened 6 years ago
I'm working with the @hwang2014 on this, temporarily we autoremediate this issue with the following script:
#!/bin/bash
timeout 30 gearadmin --status
if [ $? -eq 124 ]
then
TIMESTAMP=$(date)
service gearmand restart
echo "$TIMESTAMP: restarting gearmand due to 124 status" >> /tmp/gearmand_status.log
fi
However this is not desirable. This issue seems to only be present when gearmand is dealing with clients that are geographically apart, with a latency of ~40-50ms between them. We do not see this issue with gearmand and clients located in the same datacenter. while 40-50ms shouldn't be a concern, its a datapoint worth sharing.
We're running gearmand on centos 6.6
Hi! This is definitely a huge problem if you're seeing it. I want to bring up two things:
1) CentOS 6 is really old, released 6 years ago. I understand, it's hard to roll out new platforms, but CentOS 7 has been out for 3 years. So I'm hesitant to spend a ton of time chasing any bugs that are related to CentOS 6 versions of software.
2) What version of libevent are you using?
3) Can you send logs from your gearmand's? The higher the logging level, the better.
4) Do you have workers or clients that log their gearman interactions? Do you see a lot of failed connections from them?
Thanks for your patience and for using Gearman!
Hi, the libevent RPM is libevent-1.4.13-4.el6.x86_64
fairly frequently, we got the following errors in gearmand log:
ERROR 2017-12-16 00:31:18.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:31:46.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:05.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:15.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:21.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:44.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:02.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:27.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:32.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:44.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:00.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:08.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:34.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:35:02.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
libgearman-server/io.cc:218 is in _connection_read()
. It might indicate that the client or worker is closing the connection to gearmand when gearmand doesn't expect it to. Maybe the problem is in either the implementation of the clients or workers? What language did you develop them with?
I'm not sure if it will help or not, but you can start gearmand with --verbose DEBUG
or maybe --verbose TRACE
(I forget which is higher) to get more detailed logs.
I'm seeing an issue strikingly similar to this. We've attempted increasing the number of threads and have reduced the logging level. Has anyone come across any potential fixes?
I'm seeing an issue strikingly similar to this.
What platform/distribution/kernel? What version of libevent?
I guess we eliminated an effect of the issue in #102 but not the cause.
If you can avoid using gearadmin status, you should. Increasing threads will not help, because it locks all of the internal structures to produce the status report. I've long thought we should add statsd support for gauges to gearmand so that it can just fire off status updates to an external system and not encourage people to use the admin protocol.
However, this seems to be related to connection handling. I don't have the resources to debug this. However, if I had to guess, I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.
For anyone experiencing this effect, please send your exact libevent and OS/kernel versions and your client libraries if possible, to help us track down the root cause. Thanks.
I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.
Regarding to my experiences I would confirm this assumption.
I don't think it depend on particular environment because the issue exists for a log time. [1]
Unfortunately I couldn't simulate this behavior of gearmand
with my client/worker implementations.
[1] https://groups.google.com/forum/#!topic/gearman/nyvLh0ZhmvA
We're running GearmanD in a Docker container running in host network mode. We're leveraging a phusion base image, which effectively means we're running on Ubuntu 16.04 LTS base. We've installed libevent-2.0-5
as part of the image build.
Here's an example of the Dockerfile:
FROM phusion/baseimage:0.9.20
ARG GEARMAN_REPO=https://github.com/gearman/gearmand
ARG GEARMAN_VERSION=1.1.17
# Install GearmanD Dependencies
RUN apt-get update -qq && \
apt-get install -y curl gcc make g++ libboost-all-dev gperf libevent-dev \
uuid-dev libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev \
unzip && \
curl -L --silent -o /tmp/cyassl.zip http://www.yassl.com/cyassl-3.2.0.zip && \
cd /tmp && \
unzip -qq cyassl.zip && \
cd cyassl* && \
./configure \
--prefix=/usr && \
make && \
make install && \
mkdir /tmp/gearman && \
curl -L --insecure --silent ${GEARMAN_REPO}/releases/download/${GEARMAN_VERSION}/gearmand-${GEARMAN_VERSION}.tar.gz \
| tar -xz --strip 1 -C /tmp/gearman && \
cd /tmp/gearman && \
./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--enable-hiredis && \
make && \
make install && \
adduser --system --no-create-home --disabled-login --disabled-password --group gearman && \
apt-get purge -y gcc make gperf libevent-dev uuid-dev \
libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev unzip && \
apt-get autoremove -y && \
apt-get install -y telnet libmysqlclient20 libhiredis0.13 libevent-2.0-5 libpq5 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
EXPOSE 4730
ENTRYPOINT ["gearmand"]
You can run it with the following docker-compose setup:
version: '2.1'
services:
gearmand:
build:
context: .
dockerfile: Dockerfile.gearmand
command:
- --listen=0.0.0.0
- --round-robin
- --verbose=DEBUG
- --log-file=/dev/stdout
- --port=4730
- --queue-type=MySQL
- --mysql-host=mysql
- --mysql-port=3306
- --mysql-user=root
- --mysql-password=password
- --mysql-db=gearman
- --mysql-table=gearman_1
links:
- mysql
ports:
- "4730:4730"
depends_on:
mysql:
condition: service_healthy
mysql:
image: mysql:5.6
healthcheck:
test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
interval: 20s
timeout: 20s
retries: 3
environment:
- MYSQL_ROOT_PASSWORD=password
- MYSQL_DATABASE=gearman
Using --threads=0
solves the issue for me. Had tons of CLOSE_WAIT connections after a couple of minutes and couldn't gracefully stop gearmand. But with disabling threads it runs quite smootly so far.
I had this problem. I decided to put a time of 30s for tcp connections. Solved for me. echo 30 > /proc/sys/net/ipv4/tcp_tw_recycle
That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?
That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?
Linux: Centos 7 / Nagios
We using gearmand 1.1.21.0 in a docker contianer which base on image docker.io/library/python:3.11.4
libev-libevent-dev/oldstable 1:4.33-1 all libevent-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic] libevent-core-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]
Gearmand start with "gearmand -t 10 --job-retries 1 --verbose DEBUG -p 4730"
ERROR 2024-02-26 00:01:26.998647 [ 11 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221 ERROR 2024-02-26 00:14:33.430740 [ 13 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221 ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
Can anybody has similar issues and any good solution ?
@huamxu wrote:
Can anybody has similar issues and any good solution ?
Could you try the suggestion in the following comment?
https://github.com/gearman/gearmand/issues/150#issuecomment-858792403
You might have to do that both on the Docker host and inside the Docker container.
If that doesn't solve the problem, try changing "-t 10" to "-t 0" in your gearmand arguments. It's not an ideal solution, but it worked for the person in https://github.com/gearman/gearmand/issues/150#issuecomment-539870970.
You could also try removing "--job-retries 1" from your gearmand arguments.
Thanks esabol, I will have a try
@esabol ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
Do you or any body else know what is the "connection timeout" ? how to configure the timeout ? Thanks very much
@huamxu : That's not the problem. Did you try both of the suggestions I (and others before me) made up above here?
Yes we have tried. But problem still exist. User scenario is that 1: Jenkins gearman plugin act gear works 2: multi-jenkins master with same configuration connect to the gearman server
Gearmand didn't receive anything after the job finished, from gearman plugin log they send the WORK_COMPLETE packet
How long did that job take to execute? Do you have any firewalls or such between gearmand and the Jenkins plugin?
The gearman plugin for Jenkins is just a worker. Gearman wasn't really designed with multi-minute jobs in mind, but the Jenkins plugin makes it work OK. Still, once you get the JOB_ASSIGN packet, no communication is going back and forth until the WORK_COMPLETE. So if your jobs are really long, and there's a NAT gateway, or stateful firewall at play, then those have timeouts, and they can be exceeded. So that's the first thing I'd check. You can make the connection alive with keepalives. The default time before one is sent is really long though, 2 hours:
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
I think you have a few options:
Anyway, the original problem, where threading seems to be causing CLOSE_WAIT buildup, is not what you seem to have.
@SpamapS Thanks very much That's good info for locla investigating, Jenkins job running sometimes about 2 hours Do you know how to extend the timeouts about 2nd option """
I will try the 3rd option to reduce tcp_keepalive_time and also check the gearman jenkins plugin
@SpamapS @esabol Thanks for your good support. After enable keepalive issue didn't happened again for a long time ^_^
@pythonerdog wrote:
After enable keepalive issue didn't happened again for a long time ^_^
That's good to know. Thanks!
In my use case, there are multiple thousands of clients talk with two gearmand servers (v1.1.17). we periodically run "gearadmin --status" on the servers to check job status. It is observed that sometimes (repeatable almost every day), "gearadmin --status" stucks. When we check connection stats, it shows up many CLOSE-WAIT connections:
ss -tan | awk '{print $1}' | sort | uniq -c 207364 CLOSE-WAIT 17738 ESTAB 64 LAST-ACK 14 LISTEN 1 State 4 TIME-WAIT
The temp workaround is to restart gearmand for recovery.