Monitoring TTI - Githubissues

forwardemail / status.forwardemail.net

Status Page

https://status.forwardemail.net

MIT License

20 stars 3 forks source link

Monitoring TTI #1201

Closed titanism closed 1 month ago

titanism commented 1 month ago

We are actively monitoring our TTI after several issues have been identified. See https://github.com/forwardemail/status.forwardemail.net/issues/1201#issuecomment-2267260595.

If you have any issues receiving or sending mail, please email support@forwardemail.net.

Thank you! 🙏

titanism commented 1 month ago

We believe the culprit to be that Cloudflare has rate limited our MX servers (or is not properly responding in time). We have changed DNS providers to use Google and we are investigating if this is the cause with tests over the next hour. If this proves correct, we will try to notify the team at Cloudflare.

Note that we do cache DNS requests using 🍊 Tangerine, but most DNS TTL are set to 300, so every 5 min the cache expires, resulting in another DNS request.

titanism commented 1 month ago

Cloudflare DNS is not the culprit.

titanism commented 1 month ago

We've determined the culprit to be sporadically high CPU usage on our MX1/MX2 server. We are investigating whether this is a DDOS attack or a performance issue related to code or server resources.

titanism commented 1 month ago

We believe the issue is AMS3 maintenance on the network for our SMTP server in AMS3 at https://status.digitalocean.com/incidents/ks2j22y8jm6k.

We are taking it offline until Digital Ocean support follows up with us.

This issue first appeared for outbound SMTP and TTI monitoring on 7/31-8/2, which is overlap with the maintenance window as posted https://status.digitalocean.com/incidents/ks2j22y8jm6k.

Additionally, we have attempted to connect from our SMTP AMS3 region to MX1 server as a test, and can confirm extremely slow network performance:

openssl s_client -starttls smtp -connect mx1.forwardemail.net:25

(takes 10+ seconds at least to connect)

However from our SMTP San Jose region, it works OK.

We are closing this as we have resolved it and taken the affected server offline and out of rotation.

titanism commented 1 month ago

We are re-opening as we are still seeing issues other than AMS3. Stay tuned...

titanism commented 1 month ago

After a lengthy investigation (all day) we determined three issues:

SMTP server in AMS3 region has had degraded network performance since Digital Ocean reported maintenance (see https://status.digitalocean.com/incidents/ks2j22y8jm6k) – this affected outbound SMTP
Unmonitored DDOS attack on our MX1/MX2 servers resulting in thousands of opened sockets – this affected inbound forwarding and IMAP (and subsequently caused our TTI monitoring checks to report lengthy durations) – we fixed by blocking the spammers and are rewriting our socket rate limiting check to be moved to earlier in the code logic (before greylisting even occurs)
Google DNS having timeout issues on our MX1/MX2 servers for non-application related DNS requests (we fixed by switching to Cloudflare only at server level, see https://github.com/forwardemail/forwardemail.net/commit/a0ee91f2313e87df046842e3d85e495c23af1dd7)

titanism commented 1 month ago

Another update here: we have removed AMS3 SMTP server completely and spun up a new server today, which is in-production and online with a clean IP reputation under Digital Ocean SFO3 region.