Python core dump or memcached/redis I/O error can cause other-process `apply_async` to loop for `thundering_herd_timeout`

PolicyStat / jobtastic

Make your user-responsive long-running Celery jobs totally awesomer.

http://policystat.github.com/jobtastic/

MIT License

644 stars 61 forks source link

Python core dump or memcached/redis I/O error can cause other-process `apply_async` to loop for `thundering_herd_timeout` #66

Open kylegibson opened 8 years ago

kylegibson commented 8 years ago

See here: https://github.com/PolicyStat/jobtastic/blob/master/jobtastic/cache/base.py#L91

The timeout doesn't mean that add will block and wait. It just means that if the key can be set, it will be set with that timeout. So if the key already exists, this results in a busy loop.

winhamwr commented 8 years ago

I think this is actually #9. I'm going to write up a potential solution to this problem on that issue, based on our discussion.

winhamwr commented 8 years ago

This is actually different from #9. I was wrong. Because of the way we do lock.acquire in both memcached and redis with no timeout possibility, if for some reason the lock.release doesn't happen in another thread/process, the lock aquisition will wait forever.

I think that's probably low impact, since it would require the python code to have core dumped or the call to lock.release() or self.cache.delete to have failed because of network or memcached/redis problems. In that case, the other processes waiting on the lock would be in an infinite loop.

winhamwr commented 8 years ago

Probably the behavior we want is for those other processes to fail very quickly if they're stuck waiting for a lock. thundering_herd_timeout is about how long we should wait for a queued task to complete, which would depend on how busy your workers are and how long the task takes to run.

This timeout is more about how long you would expect it to take for a process to run through the apply_async method, which does no work other than checking the cache and possibly queueing a task. My 20-minute reporting/analytics/backup task should fail in a matter of seconds waiting on the task to actually be queued. It shouldn't wait in a blocking (probably web) process.

winhamwr commented 8 years ago

I created #68 to document how I think a feature might look to do simultaneous execution prevention.