Closed GoogleCodeExporter closed 8 years ago
additional info:
log file doesn't say anything about disconnection. maybe the worker just
doesn't accept new jobs anymore ?!
Original comment by stiwanch...@googlemail.com
on 13 Aug 2012 at 7:41
Hi stiwanchinazki,
I'm looking into this. I'll keep you updated.
Thanks,
isaiah.v
Original comment by isaiah.v...@gmail.com
on 13 Aug 2012 at 7:41
So far, I cannot reproduce this. Could you attach your client too?
Original comment by isaiah.v...@gmail.com
on 14 Aug 2012 at 2:25
Hi Isaiah,
thanks for your effort.
I created two pastebins.
Client: http://pastebin.com/kF3HvQUP
Worker: http://pastebin.com/JHfTjK11
Both processes are still alive when gearman doesn't do anything anymore (i can
notice this, because a lifesign is written to a mysql database)
-Stiwan
Original comment by stiwanch...@googlemail.com
on 15 Aug 2012 at 8:48
Try to set the maximum concurrency (worker) to a very low value, like 2 or 3.
i have the feeling, that after that number of worked jobs, the worker doesn't
accept new jobs.
Original comment by stiwanch...@googlemail.com
on 15 Aug 2012 at 10:42
in my production environment, i start about 10 workers (10 functions with 1-10
max concurrency). in the beginning it says:
130 [gearman-7] INFO gearman - [remoteip:4730] : Connected
130 [gearman-9] INFO gearman - [remoteip:4730] : Connected
130 [gearman-8] INFO gearman - [remoteip:4730] : Connected
130 [gearman-5] INFO gearman - [remoteip:4730] : Connected
130 [gearman-6] INFO gearman - [remoteip:4730] : Connected
130 [gearman-3] INFO gearman - [remoteip:4730] : Connected
130 [gearman-1] INFO gearman - [remoteip:4730] : Connected
130 [gearman-4] INFO gearman - [remoteip:4730] : Connected
132 [gearman-10] INFO gearman - [remoteip:4730] : Connected
132 [gearman-11] INFO gearman - [remoteip:4730] : Connected
132 [gearman-9] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-8] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-5] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-6] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-1] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-4] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-3] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-10] INFO gearman - [remoteip:4730] : OUT : CAN_DO
132 [gearman-7] INFO gearman - [remoteip:4730] : OUT : CAN_DO
134 [gearman-11] INFO gearman - [remoteip:4730] : OUT : CAN_DO
134 [gearman-12] INFO gearman - [remoteip:4730] : Connected
134 [gearman-12] INFO gearman - [remoteip:4730] : OUT : CAN_DO
135 [gearman-4] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
135 [gearman-13] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
135 [gearman-15] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-3] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-4] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
136 [gearman-14] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-4] INFO gearman - [remoteip:4730] : Connected
138 [gearman-9] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-4] INFO gearman - [remoteip:4730] : OUT : CAN_DO
138 [gearman-8] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-6] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
138 [gearman-14] INFO gearman - [remoteip:4730] : Connected
138 [gearman-13] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
139 [gearman-14] INFO gearman - [remoteip:4730] : OUT : CAN_DO
139 [gearman-8] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
141 [gearman-8] INFO gearman - [remoteip:4730] : Connected
141 [gearman-8] INFO gearman - [remoteip:4730] : OUT : CAN_DO
142 [gearman-14] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
156 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
158 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
160 [gearman-14] INFO gearman - [remoteip:4730] : IN : NO_JOB
160 [gearman-14] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
161 [gearman-14] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
162 [gearman-14] INFO gearman - [remoteip:4730] : IN : NO_JOB
162 [gearman-14] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
168 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
173 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
175 [gearman-5] INFO gearman - [remoteip:4730] : IN : NO_JOB
175 [gearman-5] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
178 [gearman-5] INFO gearman - [remoteip:4730] : IN : NO_JOB
178 [gearman-5] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
179 [gearman-5] INFO gearman - [remoteip:4730] : IN : JOB_ASSIGN
179 [gearman-7] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
and in the end it's down to two:
19340115 [gearman-619] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
19340115 [gearman-614] INFO gearman - [remoteip:4730] : OUT : GRAB_JOB
19340128 [gearman-614] INFO gearman - [remoteip:4730] : IN : NO_JOB
19340128 [gearman-614] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
19340129 [gearman-614] INFO gearman - [remoteip:4730] : IN : NO_JOB
19340129 [gearman-614] INFO gearman - [remoteip:4730] : OUT : PRE_SLEEP
VisualVM shows that there are 5 threads alive (in the beginning it were 10) and
the number of those threads is varying between 2 and 5.
Original comment by stiwanch...@googlemail.com
on 15 Aug 2012 at 2:41
Based on the log... it looks like the worker is still asking for work, but the
server is not giving it. Periodically, are seeing the workers sending
"GRAB_JOB" packets? and getting a "NO_JOB" response? If so... I'm thinking the
worker is working correctly. If that's the case, I'd make sure the server's
queue was not empty, and the client was still successfully submitting jobs. Use
the GearmanJobReturn returned from the call,
client.submitBackgroundJob(functionName,new byte[0]).
Given there is a maximum of concurrency of 5 per worker and 3 workers, I would
expect 15+ threads if all the workers are kept busy.
Original comment by isaiah.v...@gmail.com
on 16 Aug 2012 at 2:54
Isaiah,
yes, it's still periodically sending GRAB_JOB.
How can i check the server's queue?
What should i do with the GearmanJobReturn?
I see there are two methods: poll and isEOF. ´´
Currently, i'm just submitting and doing nothing with the GearmanJobReturn.
-Stiwan
Original comment by stiwanch...@googlemail.com
on 16 Aug 2012 at 4:34
Isaiah,
when restarting the client, the worker still doesn't accept new jobs. seems to
be on the worker's end. it seems to me, that the worker doesn't mark the jobs
as finished and doesn't accept new ones:
in production, my client submits one function at a time (let's call the
function foo). the worker has max concurrency 10 for foo. after foo is
finished, the client submits the next foo job. after 10 jobs, the worker
doesn't accept new jobs.
My Client shows this when submitting a Background job (at the time, when the
job doesn't reach the worker)
19:02:26,896 [main] INFO - Submitted Job: PgTX2zwr
133794 [gearman-20] INFO gearman - [remoteip:4730] : OUT : SUBMIT_JOB_BG
133822 [gearman-20] INFO gearman - [remoteip:4730] : IN : JOB_CREATED
19:02:26,966 [JobStatusPrint-PgTX2zwr] INFO - Event: GEARMAN_SUBMIT_SUCCESS
after Gearman_Submit_Success it is instantly EOF, which should be right.
I'm currently checking, if long jobs (which take about 60+ minutes) could cause
this behaviour. i tested it in production with very long jobs, and now i'm down
to using 1-5 minutes, which is now working fine. will get back to you about that
-Stiwan
Original comment by stiwanch...@googlemail.com
on 16 Aug 2012 at 6:27
Hi Stiwan,
Is v0.6.5 the first version you have used? Or did you upgrade from an earlier
version? Because between v0.6.4 and v0.6.5, I added in a check to see if a
disconnect occurred while working on a job. If the check fails, the complete
packet is not sent.
I'm thinking gearmand is closing the connection due to a timeout.
In versions lower than 0.6.5, if a disconnect occurred and the connection was
re-established while the GearmanFunction was executing a job, the worker would
still tell the server that it completed the job. I found that to be an
undesirable behavior, so I "fixed" it.
On the server side, when a disconnect occurs, it re-assigns the job. So the job
may go to more than one worker (or given concurrency is greater than 1, it may
be the same worker). When the WORK_COMPLETE packet comes in, the server does
not care if it doesn't come from the assigned worker. The server will complete
the job. If more than one worker attempts to complete a job, it just takes the
first one.
I think what might be happening now it, gearmand is disconnecting from all of
the workers and reestablishing connections, the worker sees that it had
disconnected before it finished the job, and does not sent the WORK_COMPLETE
packet.
Does that make since. If we find this to be the case, the next step would be to
figure out a way to make sure the worker is not disconnecting from the server.
Original comment by isaiah.v...@gmail.com
on 16 Aug 2012 at 7:28
Yes, i also think, that this behaviour is caused by a timeout.
I could observe this behaviour:
1. "short" jobs - about 1-5 min - are completed successfully and their workers
accept new jobs
2. "long" jobs - about 20+ min - are completed successfully but their workers
DONT accept new jobs.
I tested this twice about 4 hours. I conclude that it must have something to do
with a timeout. I don't know if this is caused by gearmand or
java-gearman-service.
I saw that the PHP binding has a settimeout function for worker:
php.net/manual/de/gearmanworker.settimeout.php
Maybe it would work better if i sent a signal from the worker every once in
awhile?
I'm not sure if WORK_COMPLETE is sent, according to my log it is sent in some
cases, have to dig deeper into this.
I'm using 0.6.5 but was using 0.6.2 before - i don't think their are any
remains in my project - i don't understand this comment.
I'll check the WORK_COMPLETE behaviour, will post the result shortly.
Original comment by stiwanch...@googlemail.com
on 17 Aug 2012 at 6:00
Short addition to the observed behaviour:
"long" jobs (with 20+ min) - accept only maxconcurrency-1 new jobs.
so if you had a worker with maxconcurrency 10 and it shows this behaviour, the
worker automatically has a max concurrenvy of 9.
this explains why workers with concurrency 1 stop working after long jobs (1-1
= 0 :D)
Original comment by stiwanch...@googlemail.com
on 17 Aug 2012 at 6:06
[deleted comment]
OK i checked the long (20+ min) jobs.
It seems that they actually send WORK_COMPLETE
2040609 [MYJOB-ZYqxDpTQ] INFO gearman - [remoteip:4730] : OUT : WORK_COMPLETE
please notice, that this worker now doesn't accept new jobs.
the short jobs also send WORK_COMPLETE, but accept new jobs.
Original comment by stiwanch...@googlemail.com
on 17 Aug 2012 at 7:01
aaaah breakthrough :-)
i see now what is causing all this!
09:23:02,530 [gearman-37] WARN - Lost Gearman Connection: FUNCTION2 -
UNEXPECTED_DISCONNECT (remoteip:4730)
i log this event when lostConnection in the GearmanLostConnectionPolicy is
called for the worker.
Notice that this is another worker than the original worker (i have a setup
where worker1 and worker2 alternate)
so there must be some kind of disconnect. my lost connection policy looks like
this:
this.gearmanWorker.setLostConnectionPolicy(new GearmanLostConnectionPolicy() {
@Override
public GearmanLostConnectionAction lostConnection(GearmanServer gearmanServer, GearmanLostConnectionGrounds gearmanLostConnectionGrounds) {
logger.warn("Lost Gearman Connection: " + worker.getJobType() + " - " + gearmanLostConnectionGrounds + " (" + gearmanServer.getHostName() + ":" + gearmanServer.getPort() + ")");
return GearmanLostConnectionAction.RECONNECT;
}
@Override
public void shutdownServer(GearmanServer gearmanServer) {
logger.warn("Shutdown Gearman Server: " + worker.getJobType() + " (" + gearmanServer.getHostName() + ":" + gearmanServer.getPort() + ")");
}
});
-Stiwan
and
this.gearmanWorker.setReconnectPeriod(5, TimeUnit.MINUTES);
-Stiwan
Original comment by stiwanch...@googlemail.com
on 17 Aug 2012 at 7:29
it seems that, if i call "sendData(new byte[0])", there's no disconnect to the
jobserver.
so, that's my temporary solution, but i hope we find the solution to this issue.
Original comment by stiwanch...@googlemail.com
on 18 Aug 2012 at 10:32
Hi Stiwan,
From a user standpoint, you should not need to ping the server in order to keep
the connection alive. The worker should do this automatically. There are
"ECHO_REQ" and "ECHO_RES" packets in the gearman
protocol(http://gearman.org/index.php?id=protocol). I think I'm going to use
this to ping the server if communication has not occurred some period of time.
Thanks,
isaiah.v
Original comment by isaiah.v...@gmail.com
on 18 Aug 2012 at 4:17
Hi Stiwan,
As a side note, this has nothing to do with the issue at hand, but you might be
able to save a few cpu cycles by getting rid of the three threads created
created when setting up the workers.
If you want to setup 3 workers, you could do this:
public class BugWorker {
public static void main(String[] args) throws Exception {
Gearman gearman = Gearman.createGearman();
GearmanWorker[] workers = new GearmanWorker[3];
workers[0] = createWorker("function1", gearman, new WorkerObject());
workers[1] = createWorker("function2", gearman, new WorkerObject());
workers[2] = createWorker("function3", gearman, new WorkerObject());
}
public static GearmanWorker createWorker(String functionName, Gearman gearman, WorkerObject workerObject) {
GearmanWorker worker = gearman.createGearmanWorker();
worker.addServer(gearman.createGearmanServer("ip",1234));
worker.addFunction(functionName, workerObject);
worker.setMaximumConcurrency(5);
worker.setReconnectPeriod(5, TimeUnit.MINUTES);
return worker;
}
}
final class WorkerObject implements GearmanFunction {
@Override
public byte[] work(String s, byte[] bytes, GearmanFunctionCallback gearmanFunctionCallback) throws Exception {
// do some stuff. for example sleep 10 seconds.
Thread.sleep(10000);
return new byte[0];
}
}
The will remove the statement, "while(true);" You may even want to use only
one worker with all three function registered to it. If so, try this:
public class BugWorker {
public static void main(String[] args) throws Exception {
Gearman gearman = Gearman.createGearman();
GearmanWorker worker = gearman.createGearmanWorker();
worker.addServer(gearman.createGearmanServer("ip",1234));
worker.addFunction("function1", new WorkerObject());
worker.addFunction("function2", new WorkerObject());
worker.addFunction("function3", new WorkerObject());
worker.setMaximumConcurrency(15);
worker.setReconnectPeriod(5, TimeUnit.MINUTES);
}
}
final class WorkerObject implements GearmanFunction {
@Override
public byte[] work(String s, byte[] bytes, GearmanFunctionCallback gearmanFunctionCallback) throws Exception {
// do some stuff. for example sleep 10 seconds.
Thread.sleep(10000);
return new byte[0];
}
}
Original comment by isaiah.v...@gmail.com
on 18 Aug 2012 at 4:47
Hi Isaiah,
thanks for your help regarding the improvement of my threading.
It's great to hear that i could help a little and that this issue will be fixed
:-)
-Stiwan
Original comment by stiwanch...@googlemail.com
on 18 Aug 2012 at 5:49
Hi Stiwan,
I really appreciate the feedback. If you find any more issues or have any
questions, please let me know.
I'll keep you updated on the progress of this issue.
thanks,
isaiah.v
Original comment by isaiah.v...@gmail.com
on 18 Aug 2012 at 7:44
Fix committed to trunk. This fix will be in version 0.6.6. This should be
released in the next week.
Original comment by isaiah.v...@gmail.com
on 22 Sep 2012 at 10:26
Original comment by isaiah.v...@gmail.com
on 22 Sep 2012 at 10:27
Original issue reported on code.google.com by
stiwanch...@googlemail.com
on 13 Aug 2012 at 7:32