Closed cmc333333 closed 8 years ago
Hrm. Something specifically to check: I had an issue with my s3 config, so the worker barfed:
[2016-05-27 17:13:14,242: ERROR/MainProcess] Task regulations.tasks.submit_comment[6d864d01-d371-49ad-80ac-32fab87d6be1] raised unexpected: ParamValidationError("Parameter validation failed:\nInvalid type for parameter Bucket, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>",)
Traceback (most recent call last):
File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/home/vagrant/regulations-site/regulations/tasks.py", line 51, in submit_comment
pdf_url = cache_pdf(comment_pdf, metadata_url)
File "/home/vagrant/regulations-site/regulations/tasks.py", line 126, in cache_pdf
url = SignedUrl.generate()
File "/home/vagrant/regulations-site/regulations/tasks.py", line 198, in generate
Params=params,
File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/botocore/signers.py", line 487, in generate_presigned_url
params, operation_model)
File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/botocore/validate.py", line 269, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
ParamValidationError: Parameter validation failed:
Invalid type for parameter Bucket, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>
(might be a scroll error in the above, my terminal was wonky)
Unfortunately, restarting the worked doesn't lead it to try again
Sharing short convo on the topic for transparency:
@cmc333333:
@vrajmohan @jmcarp: sorry to distract, but I don't think we have any automated mechanism to know when either 1) the worker process isn't started/it died nor 2) when the worker process retried 3 times (a "success" state in all our metrics).
I vaguely recall (but need to confirm via testing) that 3) worker explosions won't lose data and that 4) worker explosions would trigger some sort of notification in new relic.
I think an appropriate course of action is:
- Account for 1) through manual inspection
- Poke around the newrelic library to see if we can trigger a notification for 2)
- Confirm via testing 3) and 4)
Does that sound right?
@vrajmohan:
That sounds right. 2) can be detected from the logs and from the presence of data [via manual inspection of ...]
regulations_failedcommentsubmission
.
Unfortunately, our current architecture won't notify us of most failures (though they will be saved). We'll need to manually check -- a pain, but we're out of time.
We need to be sure that our system notifies us when there's a submission failure and the worker gives up. We'll need to manually follow up to submit.
For this issue to be closed, we should have a plan and have tested it.
Todos:
Verify visibility timeout, that if a worker crashes, it will be retried in a hourImplement general purpose try-catch for exceptionsCELERY_ACKS_LATE = True
global setting; add it only to the relevant tasksnewrelic
library to see if we can trigger a notification