18F / epa-notice

Web interface for viewing and commenting on proposed changes to federal regulations
Other
7 stars 17 forks source link

Alerts on worker failure #286

Closed cmc333333 closed 8 years ago

cmc333333 commented 8 years ago

We need to be sure that our system notifies us when there's a submission failure and the worker gives up. We'll need to manually follow up to submit.

For this issue to be closed, we should have a plan and have tested it.

Todos:

cmc333333 commented 8 years ago

Hrm. Something specifically to check: I had an issue with my s3 config, so the worker barfed:

[2016-05-27 17:13:14,242: ERROR/MainProcess] Task regulations.tasks.submit_comment[6d864d01-d371-49ad-80ac-32fab87d6be1] raised unexpected: ParamValidationError("Parameter validation failed:\nInvalid type for parameter Bucket, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>",)
Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/vagrant/regulations-site/regulations/tasks.py", line 51, in submit_comment
    pdf_url = cache_pdf(comment_pdf, metadata_url)
  File "/home/vagrant/regulations-site/regulations/tasks.py", line 126, in cache_pdf
    url = SignedUrl.generate()
  File "/home/vagrant/regulations-site/regulations/tasks.py", line 198, in generate
    Params=params,
  File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/botocore/signers.py", line 487, in generate_presigned_url
    params, operation_model)
  File "/home/vagrant/.virtualenvs/nc/local/lib/python2.7/site-packages/botocore/validate.py", line 269, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
ParamValidationError: Parameter validation failed:
Invalid type for parameter Bucket, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>

(might be a scroll error in the above, my terminal was wonky)

Unfortunately, restarting the worked doesn't lead it to try again

cmc333333 commented 8 years ago

Sharing short convo on the topic for transparency:

@cmc333333:

@vrajmohan @jmcarp: sorry to distract, but I don't think we have any automated mechanism to know when either 1) the worker process isn't started/it died nor 2) when the worker process retried 3 times (a "success" state in all our metrics).

I vaguely recall (but need to confirm via testing) that 3) worker explosions won't lose data and that 4) worker explosions would trigger some sort of notification in new relic.

I ​think​ an appropriate course of action is:

  • Account for 1) through manual inspection
  • Poke around the newrelic library to see if we can trigger a notification for 2)
  • Confirm via testing 3) and 4)

Does that sound right?

@vrajmohan:

That sounds right. 2) can be detected from the logs and from the presence of data [via manual inspection of ...] regulations_failedcommentsubmission.

cmc333333 commented 8 years ago

Unfortunately, our current architecture won't notify us of most failures (though they will be saved). We'll need to manually check -- a pain, but we're out of time.