gearman / gearmand

http://gearman.org/
Other
737 stars 138 forks source link

gearmand failed to gracefully close connections under certain conditions #150

Open hwang2014 opened 6 years ago

hwang2014 commented 6 years ago

In my use case, there are multiple thousands of clients talk with two gearmand servers (v1.1.17). we periodically run "gearadmin --status" on the servers to check job status. It is observed that sometimes (repeatable almost every day), "gearadmin --status" stucks. When we check connection stats, it shows up many CLOSE-WAIT connections:

ss -tan | awk '{print $1}' | sort | uniq -c 207364 CLOSE-WAIT 17738 ESTAB 64 LAST-ACK 14 LISTEN 1 State 4 TIME-WAIT

The temp workaround is to restart gearmand for recovery.

evanrich commented 6 years ago

I'm working with the @hwang2014 on this, temporarily we autoremediate this issue with the following script:

#!/bin/bash
timeout 30 gearadmin --status
if [ $? -eq 124 ]
then
    TIMESTAMP=$(date)
    service gearmand restart
    echo "$TIMESTAMP: restarting gearmand due to 124 status" >> /tmp/gearmand_status.log
fi

However this is not desirable. This issue seems to only be present when gearmand is dealing with clients that are geographically apart, with a latency of ~40-50ms between them. We do not see this issue with gearmand and clients located in the same datacenter. while 40-50ms shouldn't be a concern, its a datapoint worth sharing.

We're running gearmand on centos 6.6

SpamapS commented 6 years ago

Hi! This is definitely a huge problem if you're seeing it. I want to bring up two things:

1) CentOS 6 is really old, released 6 years ago. I understand, it's hard to roll out new platforms, but CentOS 7 has been out for 3 years. So I'm hesitant to spend a ton of time chasing any bugs that are related to CentOS 6 versions of software.

2) What version of libevent are you using?

3) Can you send logs from your gearmand's? The higher the logging level, the better.

4) Do you have workers or clients that log their gearman interactions? Do you see a lot of failed connections from them?

Thanks for your patience and for using Gearman!

hwang2014 commented 6 years ago

Hi, the libevent RPM is libevent-1.4.13-4.el6.x86_64

fairly frequently, we got the following errors in gearmand log:

ERROR 2017-12-16 00:31:18.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:31:46.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:05.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:15.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:21.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:32:44.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:02.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:27.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:32.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:33:44.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:00.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:08.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:34:34.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 ERROR 2017-12-16 00:35:02.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218

esabol commented 6 years ago

libgearman-server/io.cc:218 is in _connection_read(). It might indicate that the client or worker is closing the connection to gearmand when gearmand doesn't expect it to. Maybe the problem is in either the implementation of the clients or workers? What language did you develop them with?

I'm not sure if it will help or not, but you can start gearmand with --verbose DEBUG or maybe --verbose TRACE (I forget which is higher) to get more detailed logs.

Strainy commented 6 years ago

I'm seeing an issue strikingly similar to this. We've attempted increasing the number of threads and have reduced the logging level. Has anyone come across any potential fixes?

esabol commented 6 years ago

I'm seeing an issue strikingly similar to this.

What platform/distribution/kernel? What version of libevent?

p-alik commented 6 years ago

I guess we eliminated an effect of the issue in #102 but not the cause.

SpamapS commented 6 years ago

If you can avoid using gearadmin status, you should. Increasing threads will not help, because it locks all of the internal structures to produce the status report. I've long thought we should add statsd support for gauges to gearmand so that it can just fire off status updates to an external system and not encourage people to use the admin protocol.

However, this seems to be related to connection handling. I don't have the resources to debug this. However, if I had to guess, I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

For anyone experiencing this effect, please send your exact libevent and OS/kernel versions and your client libraries if possible, to help us track down the root cause. Thanks.

p-alik commented 6 years ago

I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

Regarding to my experiences I would confirm this assumption.

I don't think it depend on particular environment because the issue exists for a log time. [1]

Unfortunately I couldn't simulate this behavior of gearmand with my client/worker implementations.

[1] https://groups.google.com/forum/#!topic/gearman/nyvLh0ZhmvA

Strainy commented 6 years ago

We're running GearmanD in a Docker container running in host network mode. We're leveraging a phusion base image, which effectively means we're running on Ubuntu 16.04 LTS base. We've installed libevent-2.0-5 as part of the image build.

Here's an example of the Dockerfile:

FROM phusion/baseimage:0.9.20

ARG GEARMAN_REPO=https://github.com/gearman/gearmand
ARG GEARMAN_VERSION=1.1.17

# Install GearmanD Dependencies
RUN apt-get update -qq && \
    apt-get install -y curl gcc make g++ libboost-all-dev gperf libevent-dev \
        uuid-dev libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev \
        unzip && \
    curl -L --silent -o /tmp/cyassl.zip http://www.yassl.com/cyassl-3.2.0.zip && \
    cd /tmp && \
    unzip -qq cyassl.zip && \
    cd cyassl* && \
    ./configure \
        --prefix=/usr && \
    make && \
    make install && \
    mkdir /tmp/gearman && \
    curl -L --insecure --silent ${GEARMAN_REPO}/releases/download/${GEARMAN_VERSION}/gearmand-${GEARMAN_VERSION}.tar.gz \
        | tar -xz --strip 1 -C /tmp/gearman && \
    cd /tmp/gearman && \
    ./configure \
        --prefix=/usr \
        --sysconfdir=/etc \
        --localstatedir=/var \
        --enable-hiredis && \
    make && \
    make install && \
    adduser --system --no-create-home --disabled-login --disabled-password --group gearman && \
    apt-get purge -y gcc make gperf libevent-dev uuid-dev \
        libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev unzip && \
    apt-get autoremove -y && \
    apt-get install -y telnet libmysqlclient20 libhiredis0.13 libevent-2.0-5 libpq5 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

EXPOSE 4730

ENTRYPOINT ["gearmand"]

You can run it with the following docker-compose setup:

version: '2.1'

services:

  gearmand:
    build:
      context: .
      dockerfile: Dockerfile.gearmand
    command:
      - --listen=0.0.0.0
      - --round-robin
      - --verbose=DEBUG
      - --log-file=/dev/stdout
      - --port=4730
      - --queue-type=MySQL
      - --mysql-host=mysql
      - --mysql-port=3306
      - --mysql-user=root
      - --mysql-password=password
      - --mysql-db=gearman
      - --mysql-table=gearman_1
    links:
      - mysql
    ports:
      - "4730:4730"
    depends_on:
      mysql:
        condition: service_healthy

  mysql:
    image: mysql:5.6
    healthcheck:
      test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
      interval: 20s
      timeout: 20s
      retries: 3
    environment:
      - MYSQL_ROOT_PASSWORD=password
      - MYSQL_DATABASE=gearman
sni commented 4 years ago

Using --threads=0 solves the issue for me. Had tons of CLOSE_WAIT connections after a couple of minutes and couldn't gracefully stop gearmand. But with disabling threads it runs quite smootly so far.

maxwbot commented 3 years ago

I had this problem. I decided to put a time of 30s for tcp connections. Solved for me. echo 30 > /proc/sys/net/ipv4/tcp_tw_recycle

esabol commented 3 years ago

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

maxwbot commented 3 years ago

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

Linux: Centos 7 / Nagios

huamxu commented 6 months ago

We using gearmand 1.1.21.0 in a docker contianer which base on image docker.io/library/python:3.11.4

libev-libevent-dev/oldstable 1:4.33-1 all libevent-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic] libevent-core-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]

Gearmand start with "gearmand -t 10 --job-retries 1 --verbose DEBUG -p 4730"

ERROR 2024-02-26 00:01:26.998647 [ 11 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221 ERROR 2024-02-26 00:14:33.430740 [ 13 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221 ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Can anybody has similar issues and any good solution ?

esabol commented 6 months ago

@huamxu wrote:

Can anybody has similar issues and any good solution ?

Could you try the suggestion in the following comment?

https://github.com/gearman/gearmand/issues/150#issuecomment-858792403

You might have to do that both on the Docker host and inside the Docker container.

If that doesn't solve the problem, try changing "-t 10" to "-t 0" in your gearmand arguments. It's not an ideal solution, but it worked for the person in https://github.com/gearman/gearmand/issues/150#issuecomment-539870970.

You could also try removing "--job-retries 1" from your gearmand arguments.

huamxu commented 6 months ago

Thanks esabol, I will have a try

huamxu commented 6 months ago

@esabol ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Do you or any body else know what is the "connection timeout" ? how to configure the timeout ? Thanks very much

esabol commented 6 months ago

@huamxu : That's not the problem. Did you try both of the suggestions I (and others before me) made up above here?

huamxu commented 6 months ago

Yes we have tried. But problem still exist. User scenario is that 1: Jenkins gearman plugin act gear works 2: multi-jenkins master with same configuration connect to the gearman server

Gearmand didn't receive anything after the job finished, from gearman plugin log they send the WORK_COMPLETE packet

SpamapS commented 5 months ago

How long did that job take to execute? Do you have any firewalls or such between gearmand and the Jenkins plugin?

The gearman plugin for Jenkins is just a worker. Gearman wasn't really designed with multi-minute jobs in mind, but the Jenkins plugin makes it work OK. Still, once you get the JOB_ASSIGN packet, no communication is going back and forth until the WORK_COMPLETE. So if your jobs are really long, and there's a NAT gateway, or stateful firewall at play, then those have timeouts, and they can be exceeded. So that's the first thing I'd check. You can make the connection alive with keepalives. The default time before one is sent is really long though, 2 hours:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

I think you have a few options:

  1. The gearman plugin is really old and has security vulnerabilities. Figure out a different way to run a pool of Jenkins workers. It might even be built in now? I haven't looked in a while.
  2. Extend the timeouts that are closing the connections
  3. You can reduce tcp_keepalive_time to something shorter, like 600 seconds, and that will make the kernel send keepalives that will keep a NAT/firewall happy.
  4. Update the gearman jenkins plugin to send WORK_STATUS or WORK_DATA packets periodically.

Anyway, the original problem, where threading seems to be causing CLOSE_WAIT buildup, is not what you seem to have.

huamxu commented 5 months ago

@SpamapS Thanks very much That's good info for locla investigating, Jenkins job running sometimes about 2 hours Do you know how to extend the timeouts about 2nd option """

  1. Extend the timeouts that are closing the connections """

I will try the 3rd option to reduce tcp_keepalive_time and also check the gearman jenkins plugin

pythonerdog commented 4 months ago

@SpamapS @esabol Thanks for your good support. After enable keepalive issue didn't happened again for a long time ^_^

esabol commented 4 months ago

@pythonerdog wrote:

After enable keepalive issue didn't happened again for a long time ^_^

That's good to know. Thanks!