Fix worker threads blocking termination of main thread/process

xilvar commented 9 years ago

mark worker threads as daemons so they don't block termination of the process/main thread indefinitely

jimhorng commented 9 years ago

Hi @xilvar , I am committer of enhanced message with error-response worker function, I was also thinking about to close error-response handler worker when main thread is terminated. But, error-response handler worker is responsible for resent notifications when there's error, which means, even when main thread finished sending notification and terminated, error-response handler worker might still busy on resending notifications. So it might not proper to terminate it in this case.

It can also be close explicitly by apns_enhanced.gateway_server.close_read_thread()

However, I also found bug that error-response handler worker might run forever, therefore I had issued another PR #78 to address this bug but it's still under review. Maybe you can help to push the review process? any thought?

xilvar commented 9 years ago

@jimhorng Hm. He doesn't seem to be paying much attention at the moment. I had some thoughts regarding the enhanced messaging approach now that I've played around with it because I'm working on integrating it to a very high scale system I work on.

The threading stuff works well as long as the threads don't misbehave, but without the daemonize bit, any exception at all in the python process (including in the thread which handles the inbound stuff) will both crash the process as well as freeze the process indefinitely until killed. This seems a little fragile for production use because most of the uses of apns in production are usually driven by some form of job queueing system where you really should not have code capable of freezing. It's also sort of non-kosher to have secondary threads running in a job queue.

I was thinking of taking your code and trying to figure out if I could implement a reasonable api using python's select stuff to switch back and forth between outputting messages and handling inbound errors. This would allow a job queue to run on a single thread as usual and follow a clear 'do 1 unit of work in one queue loop' model. The code to do random dev sending of messages might get slightly more complicated but it would probably be mostly an extra while loop or something.

jimhorng commented 9 years ago

@xilvar Thanks for sharing your thoughts. I agree with you that adding additional thread increases the burden to handle it properly at the main thread. I had thought to adapt single thread by keeping switching "select" for write ready and read ready, the challenge is that

what's the proper timeout for "select"? If the wait-write-ready timeout is too long, APNS might close whole connection before you can read for error-response packet.
Also, setting timeout to wait-write, wait-read and switch between them will seriously harm performance. Maybe non-blocking framework like tornado is one of the proper solution, however, it requires re-write of socket api calls and re-write client to that non-blocking framework style.

To prevent hanging on secondary thread, this PR focused on auto-close when it idles for 30 secs, and hopefully it can address most of the scenarios.

On the other hand, we used celery worker(kind of job queue process) to run pyapns, once worker process is close / job failed / job running too long, it will start over a new process, therefore secondary thread will be cleaned as well. May I know in what scenario might cause you job 'freezing' ?

xilvar commented 9 years ago

@jimhorng in our particular case, stuck jobs are considered a critical level issue that developers should never create. It's sort of an undesirable scenario because it alerts operations which would lead to alarm fatigue. A bit annoying in dev as well since a 'ctrl-c' of a direct invocation of the job doesn't kill it if the thread is stuck.

I haven't actually tried the select stuff yet, but in theory if the pyopenssl file descriptors behave nicely (like real sockets), it should be possible to use select.select to be the entire reactor. select.select takes 3 lists of file descriptors in one invocation (similar to the real C select). So you can pass in the same socket as both a 'read' socket and a 'write' socket. In theory we can then simply iterate over the returned lists of 'ready' for read and 'ready' for write and do the next batch of reading and then writing.

All theoretical, I'm going to try to assemble something that basically works to make sure select.select even works in this scenario.

Another thought I had regarding the multi-thread approach is that it might be good to have the user of the framework explicitly create an extra thread object themselves and just pass it in so that at least they know that there's a thread being spawned that they need to 'take care of'.

jimhorng commented 9 years ago

@xilvar As you can see, in enhanced mode it already using non-blocking socket with select (https://github.com/djacobs/PyAPNs/blob/master/apns.py#L540), you may help to try out the better solution to address APNS behavior. Just to mention some scenarios that I think it's not easy to resolve, plus the two I mention from previous post.

what's the proper timeout for "select"? If the wait-write-ready timeout is too long, APNS might close whole connection before you can read for error-response packet.

Also, setting timeout to wait-write, wait-read and switch between them will seriously harm performance. Maybe non-blocking framework like tornado is one of the proper solution, however, it requires re-write of socket api calls and re-write client to that non-blocking framework style

if there's only 1 thread, and it's busy on re-sending previous message while doing "send" task, it will block till all re-sending work is finished. The worse case will be something like: send 1000 message with bad token -> got 1st message's error-response -> re-send 2nd~1000th messages -> got 2nd message's error-response -> re-send 3rd~1000th messages -> ...e.t.c

xilvar commented 9 years ago

@jimhorng Ah, I see what you mean. I guess a lot of difficulty is tied up in the standard python2 problem of not having particularly good support for asynchronous io on one thread unless you use a (large) async library ie - twisted, tornado, gevent, etc.

I think it may actually still be possible to do it in one thread because at the socket-like level you don't actually have to send the apns message in one single 'send' call. We can send it in a bunch of separate chunks 'byte-wise'. We can also use a non-blocking send assuming the openssl pseudo sockets cooperate. Thus you never end up blocked in a send. The last part of it would be to always prioritize readable bytes over writeable bytes which works fine in this particular case because the error backflow is both much lower bandwidth and more important than the outbound.

So basically, we'd end up doing something like looping and blocking on a select of either readable OR writeable in one select. Then if there is readable we handle it. If there is writeable as well then we use non-blocking sends to send bytes in chunks. Between each chunk we loop again which runs that same select again. We'd also probably have to handle the return codes from send properly since we're using the non-blocking variant. Definitely pretty complicated though.

The non-blocking send avoids the issue of having one thread be stuck sending large amounts of data. The select could in theory even be infinite timeout.

Shouldn't need to switch back and forth super frequently between write and read because select will tell us what is ready and we'll get to it fairly soon due to the chunking.

The problem with spending too much dedicated time re-sending certainly could be a problem. I don't think that apple actually intends that a single ip address be able to increase send performance just using multiple threads anyway though. I think for significantly increased performance on 1 machine I might need to assign multiple ip's/adapters to it? Not sure about that because I've never really done it that way (just use more smaller job queue servers usually).

Anyway, with the thread approach, I do think it might be a good idea to more clearly indicate to the consumer of the library that there is another thread 'doing something' which needs to be handled properly. If you make the library consumer at least hand in a thread object or call a function to 'start' the secondary thread it becomes much more evident why they have problems when something goes wrong.

ExplodingCabbage commented 8 years ago

There's a lot of interesting conversation here that should probably be addressed at some point, but I think you guys reached an agreement that actual code change provided here is probably not what we want, so I'm going to close.

For comparison I think it may be interesting to look at what Node.js does. If I run:

setTimeout(function () {console.log('foo')}, 10000)

then the process blocks for 10 seconds and then logs 'foo', unless I kill it first with kill or CTRL+C. If I write

setTimeout(function () {console.log('foo')}, 10000)
throw 'bar'

then the process dies immediately as soon as 'bar' is thrown.

Node's behaviour seems right; I'm not immediately sure how to translate this behaviour it from JavaScript's queue-based async model to Python's thread-based async model, nor whether it would actually be possible to implement that behaviour in Python. I need to think more on this.

djacobs / PyAPNs

Fix worker threads blocking termination of main thread/process #91