azavea / pfb-network-connectivity

PFB Bicycle Network Connectivity
Other
40 stars 10 forks source link

Increase timeout and decrease retries for Django Q tasks; also fix exception handling #842

Closed KlaasH closed 3 years ago

KlaasH commented 3 years ago

Overview

Big imports are timing out and getting killed by Django Q. Also they leave their working files behind when that happens, so when they then get retried dozens of times, the disk ends up full.

To stop all that from happening, this increases the timeout (from 10 minutes to 3 hours) and sets the number of retries at only 1. (Side note: I can see where "no retries" might not be a good default for a task processing system and any particular number would seem arbitrary, but "keep retrying until the end of time" still doesn't seem like the best choice of default behavior to me.)

I don't know what the upper limit is for how long an import could take. I ran one of the imports that failed in production--Oslo, Norway, using this boundary and the analysis results at https://production-pfb-storage-us-east-1.s3.amazonaws.com/international_results_2021/europe/osloNO.zip--and it took just under 20 minutes: image

Since killing a job that's still actually running would be a bummer, and waiting too long to mark an import unsuccessful if it hangs doesn't seem like that big of a problem, I went with 3 hours as a timeout, since that seems well outside the range of what we would expect even very large imports to take.

Exception .message error

While looking through the logs to figure out what was going wrong here, I also saw some crashes like this: image

The reason it's failing in the first place is that it can't open the zip file, but then in our code handling that exception, we were trying to use the .message attribute of the Exception, which went away in Python 3 (see https://www.python.org/dev/peps/pep-0352/). So this fixes the places where we're doing that.

Resolves #841

Testing Instructions

Checklist

KlaasH commented 3 years ago

It turns out max_attempts isn't going to do anything here. I was looking at https://django-q.readthedocs.io/en/stable/configure.html#max-attempts, but we're running version 1.0.2 of Django Q. For some reason the Read the Docs page has lots of really old versions but nothing between 1.0.0 and stable. Looking in the repo, it turns out the v1.0.2 version of that documentation page doesn't include max_attempts.

So I think all the retrying was due to the weird interaction between their retry parameter and timeout. It's explained in the retry section, but basically timeout is the time before a task is killed and retry is the time before it's resubmitted to the queue. So if the latter is shorter than the former, a bunch of copies can get added to the queue while the first is still running. Then once the first finally fails and one of the retry tasks starts, I think new copies of that get added in the same way. So the cycle only stops, at least in our situation, when the jobs start failing within the retry time limit because the disk is full.

Another relevant and interesting difference between v1.0.2 and the current version (v1.3.6) is that retry used to default to 60 but now defaults to None. So now just setting timeout and keeping the default for retry, as we did, wouldn't cause this type of cycle.

I think we should probably upgrade Django Q some time soonish, since it seems like it's being actively developed and there might be some rough edges that they're reducing over time. But it probably doesn't make sense to do that with this PR, since what's here fixes the immediate problem.