A few weeks ago and then again yesterday, a few imports failed, then a while later they all started failing.
It turns out we have Django Q set to a 10-minute timeout, and that's not long enough for some imports, so they're getting killed before they finish. Also, when they get killed, they don't have a chance to clean up their working files, and the default number of retries for a failed task (which is what we're using) is 60. So once a task fails, it retries every 10 minutes, leaving another copy of its working files stranded in a /tmp/ directory, until the disk gets full and even small imports stop working.
So we need to raise the timeout and decrease the number of retries.
Questions to answer:
How long do these things take?
How long should we set the timeout? Ideally long enough that it'll never get triggered for a job that's still plodding along, but not so long that something can stall out then sit there a long time before it's considered dead.
How many retries should we do? It's not clear what might cause transient errors, so it's possible that a retry will never succeed. Then again, obviously there are unknowns, so it's probably worth doing at least one or two.
Is there a way to clean up the files from jobs that were killed by the task runner? Probably the answer would only be "yes" if it sends some sort of soft termination signal that we could catch and do something with. If the process just gets unceremoniously killed, then cleanup would have to be handled separately.
A few weeks ago and then again yesterday, a few imports failed, then a while later they all started failing. It turns out we have Django Q set to a 10-minute timeout, and that's not long enough for some imports, so they're getting killed before they finish. Also, when they get killed, they don't have a chance to clean up their working files, and the default number of retries for a failed task (which is what we're using) is 60. So once a task fails, it retries every 10 minutes, leaving another copy of its working files stranded in a
/tmp/
directory, until the disk gets full and even small imports stop working.So we need to raise the timeout and decrease the number of retries.
Questions to answer: