Closed KlaasH closed 3 years ago
It turns out max_attempts
isn't going to do anything here. I was looking at https://django-q.readthedocs.io/en/stable/configure.html#max-attempts, but we're running version 1.0.2 of Django Q. For some reason the Read the Docs page has lots of really old versions but nothing between 1.0.0 and stable. Looking in the repo, it turns out the v1.0.2 version of that documentation page doesn't include max_attempts
.
So I think all the retrying was due to the weird interaction between their retry
parameter and timeout
. It's explained in the retry
section, but basically timeout
is the time before a task is killed and retry
is the time before it's resubmitted to the queue. So if the latter is shorter than the former, a bunch of copies can get added to the queue while the first is still running. Then once the first finally fails and one of the retry tasks starts, I think new copies of that get added in the same way. So the cycle only stops, at least in our situation, when the jobs start failing within the retry
time limit because the disk is full.
Another relevant and interesting difference between v1.0.2 and the current version (v1.3.6) is that retry
used to default to 60 but now defaults to None
. So now just setting timeout
and keeping the default for retry
, as we did, wouldn't cause this type of cycle.
I think we should probably upgrade Django Q some time soonish, since it seems like it's being actively developed and there might be some rough edges that they're reducing over time. But it probably doesn't make sense to do that with this PR, since what's here fixes the immediate problem.
Overview
Big imports are timing out and getting killed by Django Q. Also they leave their working files behind when that happens, so when they then get retried dozens of times, the disk ends up full.
To stop all that from happening, this increases the timeout (from 10 minutes to 3 hours) and sets the number of retries at only 1. (Side note: I can see where "no retries" might not be a good default for a task processing system and any particular number would seem arbitrary, but "keep retrying until the end of time" still doesn't seem like the best choice of default behavior to me.)
I don't know what the upper limit is for how long an import could take. I ran one of the imports that failed in production--Oslo, Norway, using this boundary and the analysis results at
https://production-pfb-storage-us-east-1.s3.amazonaws.com/international_results_2021/europe/osloNO.zip
--and it took just under 20 minutes:Since killing a job that's still actually running would be a bummer, and waiting too long to mark an import unsuccessful if it hangs doesn't seem like that big of a problem, I went with 3 hours as a timeout, since that seems well outside the range of what we would expect even very large imports to take.
Exception
.message
errorWhile looking through the logs to figure out what was going wrong here, I also saw some crashes like this:
The reason it's failing in the first place is that it can't open the zip file, but then in our code handling that exception, we were trying to use the
.message
attribute of the Exception, which went away in Python 3 (see https://www.python.org/dev/peps/pep-0352/). So this fixes the places where we're doing that.Resolves #841
Testing Instructions
scripts/server
inside it to start the app.scripts/server
command should show logs, but you can also dovagrant ssh -c "cd /vagrant && docker-compose logs -f django-q"
to see them.https://www.azavea.com/
, which will load but definitely isn't a zip file of analysis results). It should raise an exception and show the stack trace in the logs, but the second error should be aLocalAnalysisFetchException
, not anAttributeError
.Checklist