abdulqudoos / java-gearman-service

Automatically exported from code.google.com/p/java-gearman-service
1 stars 0 forks source link

Connection lost, fail to reconnect #36

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Use java-gearman-service 0.6.5
2. Create one Gearman instance with workers for different functions (maybe set 
maximum concurrency)
3. Use a remote (not local) Gearman Server

What is the expected output? What do you see instead?

Expected would be, that it will stay connected to the job server, but loses 
connection after some time.

What version of the product are you using? On what operating system?

java-gearman-service 0.6.5
gearmand 0.35
Debian Squeeze

Please provide any additional information below.

Example source:

package gearmantest;

import org.gearman.Gearman;
import org.gearman.GearmanFunction;
import org.gearman.GearmanFunctionCallback;
import org.gearman.GearmanWorker;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Bug {
    public static void main(String[] args) throws Exception {
        Gearman gearman = Gearman.createGearman();

        ExecutorService executor = Executors.newFixedThreadPool(3);
        WorkerThread thread1 = new WorkerThread("function1", gearman, new WorkerObject());
        executor.execute(thread1);
        WorkerThread thread2 = new WorkerThread("function2", gearman, new WorkerObject());
        executor.execute(thread2);
        WorkerThread thread3 = new WorkerThread("function3", gearman, new WorkerObject());
        executor.execute(thread3);

    }

}

final class WorkerObject implements GearmanFunction {
    @Override
    public byte[] work(String s, byte[] bytes, GearmanFunctionCallback gearmanFunctionCallback) throws Exception {
        // do some stuff.   for example sleep 10 seconds.
        System.out.println("New Job: sleep for 10 sec.");
        Thread.sleep(10000);
        return new byte[0];
    }
}

final class WorkerThread implements Runnable {
    private GearmanWorker worker;
    public WorkerThread(String functionName, Gearman gearman, WorkerObject workerObject) {
        worker = gearman.createGearmanWorker();
        worker.addServer(gearman.createGearmanServer("ip",1234));
        worker.addFunction(functionName, workerObject);
        worker.setMaximumConcurrency(5);
        worker.setReconnectPeriod(5, TimeUnit.MINUTES);
    }

    @Override
    public void run() {
        while(true);
    }
}

this should not take any new actions after some time (couple of hours)

Original issue reported on code.google.com by stiwanch...@googlemail.com on 13 Aug 2012 at 7:32

GoogleCodeExporter commented 8 years ago
additional info:
log file doesn't say anything about disconnection. maybe the worker just 
doesn't accept new jobs anymore ?!

Original comment by stiwanch...@googlemail.com on 13 Aug 2012 at 7:41

GoogleCodeExporter commented 8 years ago
Hi stiwanchinazki,

I'm looking into this. I'll keep you updated.

Thanks,
isaiah.v

Original comment by isaiah.v...@gmail.com on 13 Aug 2012 at 7:41

GoogleCodeExporter commented 8 years ago
So far, I cannot reproduce this. Could you attach your client too?

Original comment by isaiah.v...@gmail.com on 14 Aug 2012 at 2:25

GoogleCodeExporter commented 8 years ago
Hi Isaiah,

thanks for your effort.

I created two pastebins.

Client: http://pastebin.com/kF3HvQUP
Worker: http://pastebin.com/JHfTjK11

Both processes are still alive when gearman doesn't do anything anymore (i can 
notice this, because a lifesign is written to a mysql database)

-Stiwan

Original comment by stiwanch...@googlemail.com on 15 Aug 2012 at 8:48

GoogleCodeExporter commented 8 years ago
Try to set the maximum concurrency (worker) to a very low value, like 2 or 3.

i have the feeling, that after that number of worked jobs, the worker doesn't 
accept new jobs.

Original comment by stiwanch...@googlemail.com on 15 Aug 2012 at 10:42

GoogleCodeExporter commented 8 years ago
in my production environment, i start about 10 workers (10 functions with 1-10 
max concurrency). in the beginning it says:

130 [gearman-7] INFO gearman - [remoteip:4730] : Connected
130 [gearman-9] INFO gearman - [remoteip:4730] : Connected
130 [gearman-8] INFO gearman - [remoteip:4730] : Connected
130 [gearman-5] INFO gearman - [remoteip:4730] : Connected
130 [gearman-6] INFO gearman - [remoteip:4730] : Connected
130 [gearman-3] INFO gearman - [remoteip:4730] : Connected
130 [gearman-1] INFO gearman - [remoteip:4730] : Connected
130 [gearman-4] INFO gearman - [remoteip:4730] : Connected
132 [gearman-10] INFO gearman - [remoteip:4730] : Connected
132 [gearman-11] INFO gearman - [remoteip:4730] : Connected
132 [gearman-9] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-8] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-5] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-6] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-1] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-4] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-3] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-10] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-7] INFO gearman - [remoteip:4730] : OUT : CAN_DO
134 [gearman-11] INFO gearman - [remoteip:4730] : OUT : CAN_DO
134 [gearman-12] INFO gearman - [remoteip:4730] : Connected
134 [gearman-12] INFO gearman - [remoteip:4730] : OUT : CAN_DO
135 [gearman-4] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
135 [gearman-13] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
135 [gearman-15] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-3] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-4] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-14] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-4] INFO gearman - [remoteip:4730] : Connected
138 [gearman-9] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-4] INFO gearman - [remoteip:4730] : OUT : CAN_DO
138 [gearman-8] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-6] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-14] INFO gearman - [remoteip:4730] : Connected
138 [gearman-13] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
139 [gearman-14] INFO gearman - [remoteip:4730] : OUT : CAN_DO
139 [gearman-8] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
141 [gearman-8] INFO gearman - [remoteip:4730] : Connected
141 [gearman-8] INFO gearman - [remoteip:4730] : OUT : CAN_DO
142 [gearman-14] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
156 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
158 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
160 [gearman-14] INFO gearman - [remoteip:4730] : IN : NO_JOB
160 [gearman-14] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
161 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
162 [gearman-14] INFO gearman - [remoteip:4730] : IN : NO_JOB
162 [gearman-14] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
168 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
173 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
175 [gearman-5] INFO gearman - [remoteip:4730] : IN : NO_JOB
175 [gearman-5] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
178 [gearman-5] INFO gearman - [remoteip:4730] : IN : NO_JOB
178 [gearman-5] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
179 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
179 [gearman-7] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB

and in the end it's down to two:

19340115 [gearman-619] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
19340115 [gearman-614] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
19340128 [gearman-614] INFO gearman - [remoteip:4730] : IN : NO_JOB
19340128 [gearman-614] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
19340129 [gearman-614] INFO gearman - [remoteip:4730] : IN : NO_JOB
19340129 [gearman-614] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP

VisualVM shows that there are 5 threads alive (in the beginning it were 10) and 
the number of those threads is varying between 2 and 5.

Original comment by stiwanch...@googlemail.com on 15 Aug 2012 at 2:41

GoogleCodeExporter commented 8 years ago
Based on the log... it looks like the worker is still asking for work, but the 
server is not giving it. Periodically, are seeing the workers sending 
"GRAB_JOB" packets? and getting a "NO_JOB" response? If so... I'm thinking the 
worker is working correctly. If that's the case, I'd make sure the server's 
queue was not empty, and the client was still successfully submitting jobs. Use 
the GearmanJobReturn returned from the call, 
client.submitBackgroundJob(functionName,new byte[0]).

Given there is a maximum of concurrency of 5 per worker and 3 workers, I would 
expect 15+ threads if all the workers are kept busy.

Original comment by isaiah.v...@gmail.com on 16 Aug 2012 at 2:54

GoogleCodeExporter commented 8 years ago
Isaiah,

yes, it's still periodically sending GRAB_JOB.
How can i check the server's queue?
What should i do with the GearmanJobReturn?
I see there are two methods: poll and isEOF. ´´

Currently, i'm just submitting and doing nothing with the GearmanJobReturn.

-Stiwan

Original comment by stiwanch...@googlemail.com on 16 Aug 2012 at 4:34

GoogleCodeExporter commented 8 years ago
Isaiah,

when restarting the client, the worker still doesn't accept new jobs. seems to 
be on the worker's end. it seems to me, that the worker doesn't mark the jobs 
as finished and doesn't accept new ones:

in production, my client submits one function at a time (let's call the 
function foo). the worker has max concurrency 10 for foo. after foo is 
finished, the client submits the next foo job. after 10 jobs, the worker 
doesn't accept new jobs.

My Client shows this when submitting a Background job (at the time, when the 
job doesn't reach the worker)

19:02:26,896 [main] INFO  - Submitted Job: PgTX2zwr
133794 [gearman-20] INFO gearman - [remoteip:4730] : OUT : SUBMIT_JOB_BG
133822 [gearman-20] INFO gearman - [remoteip:4730] : IN : JOB_CREATED
19:02:26,966 [JobStatusPrint-PgTX2zwr] INFO  - Event: GEARMAN_SUBMIT_SUCCESS

after Gearman_Submit_Success it is instantly EOF, which should be right.

I'm currently checking, if long jobs (which take about 60+ minutes) could cause 
this behaviour. i tested it in production with very long jobs, and now i'm down 
to using 1-5 minutes, which is now working fine. will get back to you about that

-Stiwan

Original comment by stiwanch...@googlemail.com on 16 Aug 2012 at 6:27

GoogleCodeExporter commented 8 years ago
Hi Stiwan,

Is v0.6.5 the first version you have used? Or did you upgrade from an earlier 
version? Because between v0.6.4 and v0.6.5, I added in a check to see if a 
disconnect occurred while working on a job. If the check fails, the complete 
packet is not sent.

I'm thinking gearmand is closing the connection due to a timeout.

In versions lower than 0.6.5, if a disconnect occurred and the connection was 
re-established while the GearmanFunction was executing a job, the worker would 
still tell the server that it completed the job. I found that to be an 
undesirable behavior, so I "fixed" it.

On the server side, when a disconnect occurs, it re-assigns the job. So the job 
may go to more than one worker (or given concurrency is greater than 1, it may 
be the same worker). When the WORK_COMPLETE packet comes in, the server does 
not care if it doesn't come from the assigned worker. The server will complete 
the job. If more than one worker attempts to complete a job, it just takes the 
first one.

I think what might be happening now it, gearmand is disconnecting from all of 
the workers and reestablishing connections, the worker sees that it had 
disconnected before it finished the job, and does not sent the WORK_COMPLETE 
packet.

Does that make since. If we find this to be the case, the next step would be to 
figure out a way to make sure the worker is not disconnecting from the server.

Original comment by isaiah.v...@gmail.com on 16 Aug 2012 at 7:28

GoogleCodeExporter commented 8 years ago
Yes, i also think, that this behaviour is caused by a timeout.
I could observe this behaviour:
1. "short" jobs - about 1-5 min - are completed successfully and their workers 
accept new jobs
2. "long" jobs - about 20+ min - are completed successfully but their workers 
DONT accept new jobs.

I tested this twice about 4 hours. I conclude that it must have something to do 
with a timeout. I don't know if this is caused by gearmand or 
java-gearman-service.
I saw that the PHP binding has a settimeout function for worker: 
php.net/manual/de/gearmanworker.settimeout.php

Maybe it would work better if i sent a signal from the worker every once in 
awhile?
I'm not sure if WORK_COMPLETE is sent, according to my log it is sent in some 
cases, have to dig deeper into this.

I'm using 0.6.5 but was using 0.6.2 before - i don't think their are any 
remains in my project - i don't understand this comment.

I'll check the WORK_COMPLETE behaviour, will post the result shortly.

Original comment by stiwanch...@googlemail.com on 17 Aug 2012 at 6:00

GoogleCodeExporter commented 8 years ago
Short addition to the observed behaviour:

"long" jobs (with 20+ min) - accept only maxconcurrency-1 new jobs.
so if you had a worker with maxconcurrency 10 and it shows this behaviour, the 
worker automatically has a max concurrenvy of 9.
this explains why workers with concurrency 1 stop working after long jobs (1-1 
= 0 :D)

Original comment by stiwanch...@googlemail.com on 17 Aug 2012 at 6:06

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
OK i checked the long (20+ min) jobs.
It seems that they actually send WORK_COMPLETE

2040609 [MYJOB-ZYqxDpTQ] INFO gearman - [remoteip:4730] : OUT : WORK_COMPLETE

please notice, that this worker now doesn't accept new jobs.
the short jobs also send WORK_COMPLETE, but accept new jobs.

Original comment by stiwanch...@googlemail.com on 17 Aug 2012 at 7:01

GoogleCodeExporter commented 8 years ago
aaaah breakthrough :-)
i see now what is causing all this!

09:23:02,530 [gearman-37] WARN  - Lost Gearman Connection: FUNCTION2 - 
UNEXPECTED_DISCONNECT (remoteip:4730)

i log this event when lostConnection in the GearmanLostConnectionPolicy is 
called for the worker.
Notice that this is another worker than the original worker (i have a setup 
where worker1 and worker2 alternate)

so there must be some kind of disconnect. my lost connection policy looks like 
this:

this.gearmanWorker.setLostConnectionPolicy(new GearmanLostConnectionPolicy() {
                @Override
                public GearmanLostConnectionAction lostConnection(GearmanServer gearmanServer, GearmanLostConnectionGrounds gearmanLostConnectionGrounds) {
                    logger.warn("Lost Gearman Connection: " + worker.getJobType() + " - " + gearmanLostConnectionGrounds + " (" + gearmanServer.getHostName() + ":" + gearmanServer.getPort() + ")");
                    return GearmanLostConnectionAction.RECONNECT;
                }

                @Override
                public void shutdownServer(GearmanServer gearmanServer) {
                    logger.warn("Shutdown Gearman Server: " + worker.getJobType() + " (" + gearmanServer.getHostName() + ":" + gearmanServer.getPort() + ")");
                }
            });

-Stiwan
and

this.gearmanWorker.setReconnectPeriod(5, TimeUnit.MINUTES);

-Stiwan

Original comment by stiwanch...@googlemail.com on 17 Aug 2012 at 7:29

GoogleCodeExporter commented 8 years ago
it seems that, if i call "sendData(new byte[0])", there's no disconnect to the 
jobserver.

so, that's my temporary solution, but i hope we find the solution to this issue.

Original comment by stiwanch...@googlemail.com on 18 Aug 2012 at 10:32

GoogleCodeExporter commented 8 years ago
Hi Stiwan,

From a user standpoint, you should not need to ping the server in order to keep 
the connection alive. The worker should do this automatically. There are 
"ECHO_REQ" and "ECHO_RES" packets in the gearman 
protocol(http://gearman.org/index.php?id=protocol). I think I'm going to use 
this to ping the server if communication has not occurred some period of time.

Thanks,
isaiah.v

Original comment by isaiah.v...@gmail.com on 18 Aug 2012 at 4:17

GoogleCodeExporter commented 8 years ago
Hi Stiwan,

As a side note, this has nothing to do with the issue at hand, but you might be 
able to save a few cpu cycles by getting rid of the three threads created 
created when setting up the workers.

If you want to setup 3 workers, you could do this:

public class BugWorker {
    public static void main(String[] args) throws Exception {
        Gearman gearman = Gearman.createGearman();

        GearmanWorker[] workers = new GearmanWorker[3];
        workers[0] = createWorker("function1", gearman, new WorkerObject());
        workers[1] = createWorker("function2", gearman, new WorkerObject());
        workers[2] = createWorker("function3", gearman, new WorkerObject());
    }

    public static GearmanWorker createWorker(String functionName, Gearman gearman, WorkerObject workerObject) {
        GearmanWorker worker = gearman.createGearmanWorker();
        worker.addServer(gearman.createGearmanServer("ip",1234));
        worker.addFunction(functionName, workerObject);
        worker.setMaximumConcurrency(5);
        worker.setReconnectPeriod(5, TimeUnit.MINUTES);

        return worker;
    }
}

final class WorkerObject implements GearmanFunction {
    @Override
    public byte[] work(String s, byte[] bytes, GearmanFunctionCallback gearmanFunctionCallback) throws Exception {
        // do some stuff.   for example sleep 10 seconds.
        Thread.sleep(10000);
        return new byte[0];
    }
}

The will remove the statement, "while(true);"  You may even want to use only 
one worker with all three function registered to it. If so, try this:

public class BugWorker {
    public static void main(String[] args) throws Exception {
        Gearman gearman = Gearman.createGearman();

        GearmanWorker worker = gearman.createGearmanWorker();
        worker.addServer(gearman.createGearmanServer("ip",1234));

        worker.addFunction("function1", new WorkerObject());
        worker.addFunction("function2", new WorkerObject());
        worker.addFunction("function3", new WorkerObject());

        worker.setMaximumConcurrency(15);
        worker.setReconnectPeriod(5, TimeUnit.MINUTES);
    }
}

final class WorkerObject implements GearmanFunction {
    @Override
    public byte[] work(String s, byte[] bytes, GearmanFunctionCallback gearmanFunctionCallback) throws Exception {
        // do some stuff.   for example sleep 10 seconds.
        Thread.sleep(10000);
        return new byte[0];
    }
}

Original comment by isaiah.v...@gmail.com on 18 Aug 2012 at 4:47

GoogleCodeExporter commented 8 years ago
Hi Isaiah,

thanks for your help regarding the improvement of my threading.

It's great to hear that i could help a little and that this issue will be fixed 
:-)

-Stiwan

Original comment by stiwanch...@googlemail.com on 18 Aug 2012 at 5:49

GoogleCodeExporter commented 8 years ago
Hi Stiwan,

I really appreciate the feedback. If you find any more issues or have any 
questions, please let me know.

I'll keep you updated on the progress of this issue.

thanks,
isaiah.v

Original comment by isaiah.v...@gmail.com on 18 Aug 2012 at 7:44

GoogleCodeExporter commented 8 years ago
Fix committed to trunk. This fix will be in version 0.6.6. This should be 
released in the next week.

Original comment by isaiah.v...@gmail.com on 22 Sep 2012 at 10:26

GoogleCodeExporter commented 8 years ago

Original comment by isaiah.v...@gmail.com on 22 Sep 2012 at 10:27