automatically switch ESP during downtime

tgehrs commented 7 years ago

Downtime from ESPs is a pain (4 hours into SendGrid send delays right now), and in reading about this library it would be a very neat feature if during downtime from your preferred ESP, anymail would switch to a backup or SMTP. Would this fit into the scope of anymail?

How this might work

settings contain backups

INSTALLED_APPS = (
    ...
    "anymail"
)

ANYMAIL = {
    # now would need to have multiple
    "MAILGUN_API_KEY": "<your Mailgun key>",
    "MAILGUN_SENDER_DOMAIN": 'mg.example.com',
}
EMAIL_BACKEND = "anymail.backends.mailgun.MailgunBackend" #this is the default
BACKUP_BACKENDS = ["anymail.backends.sendgrid.SendGridBackend","'django.core.mail.backends.smtp.EmailBackend'']
DEFAULT_FROM_EMAIL = "you@example.com"

webhook flow

webhooks for ESP statuspages from statuspage.io can be set up within anymail SendGrid, MailGun, SparkPost
confirm the unauthenticated webhook with a call to that statuspage's api example
check backups in order to ensure it is up and running, save chosen backup as current preferred connection
webhook for issue resolved comes in, revert to default as preferred connection

add extra step to sending an email

current_backend = get_anymail_connection()
send_mail("Password reset", "Here you go", "noreply@example.com", ["user@example.com"],
          connection=current_backend)

medmunds commented 7 years ago

Really interesting idea -- thanks for the thought you've put into this. (Believe me, I understand the pain you're feeling with SendGrid recently.)

And it's clever to use the status APIs to determine if an ESP is usable. Because the ESP's sending APIs often stay up, accepting and queuing messages, even when the ESP is significantly delaying delivery. So you have to find some other way to determine if the ESP is "down" for sending purposes. (BTW, is that statuspage.io v2/summary.json endpoint documented anywhere?)

There would be a handful of problems to work through:

1. Keeping track of the current preferred connection

Anymail deliberately doesn't maintain any persistent state, which avoids a whole lot of additional configuration overhead for users. I'm not sure how we could implement "save chosen backup as current preferred connection" within Anymail.

2. Data can diverge when mixing ESPs

This is probably more of a user-education issue: sending to the same set of recipients through multiple ESPs leads to potentially-confusing data divided among those ESPs.

If you're using your ESP's unsubscribe management, for example, you'd want to avoid backup ESPs sending to recipients on the primary ESP's unsubscribe list. And you'd have to figure out how to handle unsubscribes coming from messages sent through a backup ESP. That sort of syncing is beyond the scope of Anymail. (Similar issues with ESPs' bounced/blocked recipient lists.)

You'd also have to be careful how you interpret open click and open rate data collected by one of the ESPs. (Or really, anything where ESPs maintain data on your behalf.)

3. Deciding if an ESP is "up"

As I said earlier, I think using the status API is clever. But it also looks like it could be tricky to decide whether the ESP is "up" from the status summary.

The simplest approach would be, if there are any open incidents, consider it down. But a lot of incidents aren't about sending/delivery -- e.g., Mailgun recently had a control panel outage, and SparkPost had delays in metrics and reporting. I don't think I'd have wanted to switch to a backup ESP in either of these cases.

We could try to be smarter by looking for outages only in particular components. Each ESP labels its status components differently, so that would take some research. (E.g., SendGrid is currently showing "Mail Sending" as "Degraded Performance," but all three of its subcomponents as "Operational" -- including the ones that track "the flow of mail generated by mail.send API requests.") It could also be fragile to future status page changes.

Another problem is that (some) ESPs may not promptly list (some) service issues in their status pages. (Though I suppose you're no worse off in that case than you are now.)

All that said, though, I threw together a quick and untested implementation of a backend that checks status APIs and sends through the first "up" ESP. To avoid the whole persistent-state problem, it checks the status API(s) on every send. (Optimization is "left as an exercise for the reader." As is figuring out my bugs in it. :smile:) Feel free to try it out, fork and edit, and if you end up with something useful we can either add it to Anymail or at least to the docs as an advanced example.

tgehrs commented 7 years ago

Thanks for the thoughtful response, definitely some things to work through here. I will fork it and try to work through your points here a bit.

Statuspage.io's undocumented documentation v2 -- appending /api to a statuspage will show their documentation v1 -- this seems to be more for the owners, but includes schema such as component status:

component[status] - The status, one of operational|degraded_performance|partial_outage|major_outage.

1. Keeping track of preferred connection Would probably need to implement a simple model for this, I noticed in another issue that keeping any sort of state is outside of the scope of Anymail, with that in mind maybe this belongs in a separate project.

2. Data divergence between ESPs This is likely the deal breaker, especially if you are using ASM.

If you're using your ESP's unsubscribe management, for example, you'd want to avoid backup ESPs sending to recipients on the primary ESP's unsubscribe list.

With that in mind two potential workarounds for the most simple use case:

Through in this current SendGrid fiasco and from what I can tell most of the cases this would deal with, their API is up, so simply syncing unsubscribes that way would not be the end of the world, but once again seems a bit outside the scope of Anymail
Once the status is resolved, sync unsubscribes from the fallback ESP to

3. Deciding if an ESP is up Good points on the fragility of my suggested implementation. I reached out to StatusPage about if/how companies can make these "breaking" changes to their status page, unfortunately they do not send notifications for changes in component structure, only changes in status. I will plan to map these out (shouldn't take long) as-is but this could cause issues of not falling back.

medmunds commented 7 years ago

I thought a little more about keeping track of the preferred connection, and it seems like it would be perfectly reasonable to use Django's cache framework for this. That avoids the performance hit of checking status on every send, but without the complexity of adding models.

I updated the (still untested) gist with code to cache the preferred ESP, and added a webhook view that invalidates the cache: https://gist.github.com/medmunds/e6b837bb3b382098d775cb412b889632

tgehrs commented 7 years ago

good call with caching, seems like the perfect use case.

I forked the gist and added in a way to test for a list of components to isolate what the issue actually is so we are not falling back if for example SendGrid's marketing service is down. (Un)luckily SendGrid is having some more downtime today so I was able to confirm the is_component_working function works as expected.

Do you think checking components should replace is_backend_working or should this be a setting? I will try to work on some testing soon

medmunds commented 7 years ago

Nice. Yeah, I'd probably just move the component checking into is_backend_working. is_backend_working should represent our best guess at determining if the ESP is "up." If you've figured out the right components to check, that's a better test than the overall status check I had in there.

medmunds commented 7 years ago

I don't think this code really belongs in the core Anymail, but I'm going to add a link to your gist to the docs on using multiple backends.

anymail / django-anymail