datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

[Epic] Sending Emails when flow is broken #284

Closed zelima closed 5 years ago

zelima commented 5 years ago

As a publisher I want to get notified with something is broken to the automated dataset, so that I can take a look an fix it

Acceptance Criteria

Tasks

Analysis

Question: What do we really need here? What is the main user story?

Getting the Emails

Q: How should we add users to the list/segments? A: Both status.io and statuspage.io have a method to add Email to the subscriber's list to the incidents

Q: Who exactly should get the emails? A: I assume this service is not for everybody, so we should have some kind of flag in the database indicating so. Or alternatively we may create list of subscribed users manually and assign to the incidents

Q: Will this be automated? or manual for now? A: Really depends on who we want to send emails. If we want to define them manually than we should create them manually. But both APIs allow us to create subscribers list via API. A: Though we will need to update DB structure to flag that user is subscribed for failure. Or for now we can use certified flag and if true send notification user as well.

What info should users get?

Q: What exactly does something failed mean? A:

Technical implementation

WIP

Think we can add new method to datahub-email and use it in flowmanager (specstore).

...
if flow_status = STATE_FAILED or len errors:
  emails.send_incident(user.email)
...

Services

https://apex.sh/ping/


BASIC | PLUS | PRO
-- | -- | --
MONITORSNumber of URLs which can be monitored. | 15 | 50 | 100
TEAM MEMBERSNumber of team members who can view and manage monitors, alerts, and status pages. | 5 | 10 | 20
ALERT RULESNumber of alert rules for triggering downtime or performance notifications. | 2 | 10 | 30
REGIONSNumber of regions a single monitor will perform requests from. | 15 | 15 | 15
STATUS PAGESNumber of status pages you may create. Typically you'd have one per domain, such as status.apex.sh. | ∞ | ∞ | ∞
  | 10 | 30 | 60

This one seems to have all in one and is relativly cheept. No public API available though, we should build everything manually. Won't be abale to create incident on failure

Pingdom (https://www.pingdom.com/)

Starter | Standard | Advanced | Professional
$11.95/mo | $36.00/mo | $72.00/mo | $199/mo
-- | -- | -- | --

This one is really awesome - is monitoring the whole aplication in live, checking speed, states, performence etc... But not quite what we need for this task. It does not allow us to create custom incidents and alert users about it.

https://status.io/

All plans include

Authentication pretty simple, only provided API _DI and API_KEYs are required. Once authenticated there are several useful methods available:

import statusio
api = statusio.Api(api_id='api_id', api_key='api_key')

api.IncidentCreate(statuspage_id, notify_email=0, ...)
api.SubscriberList(statuspage_id)
api.SubscriberAdd(statuspage_id, method, address, silent=1, granular='')
api.SubscriberUpdate(statuspage_id, subscriber_id, address, granular='')
api.SubscriberRemove(statuspage_id, subscriber_id)
...

I assume statuspage_id is assigned to each status page. Also meaning multiple status page may be created Eg: for each dataset (Not sure thoug)

Think we can start experimenting with it

https://www.statuspage.io/

I like this one most

Hobby

Simple, Authentication with API key in request header.

Useful API endpoints

POST https://api.statuspage.io/v1/pages/{page_id}/incidents/{incident_id}/subscribers
GET https://api.statuspage.io/v1/pages/{page_id}/incidents/{incident_id}/subscribers

POST https://api.statuspage.io/v1/pages/{page_id}/incidents

This to (status page and status io) seem quite similar in their functionality and API. So we can choose any of them.

http://cachethq.io/

Not sure how exactly we can use this and don't really like it tbh:

IFTTP

Don't really understand how this might work with our services. Seems to me if I sit down and really dig down to their docs it will take much more time than any of listed above so skipping at this point

zelima commented 5 years ago

FIXED