PostHog / meta

This is a place to discuss non-product issues in public.
MIT License
17 stars 4 forks source link

Messaging: Startup Cancellation Apology #248

Closed joethreepwood closed 1 month ago

joethreepwood commented 1 month ago

Something got messed up somewhere, presumably by me, and we told approximately 500 users on the Startup / YC backfill campaign that their credits were cancelled - when they weren't.

I've temporarily paused the backfill campaign and it seems like the new campaign didn't spike so we can presume that's clear for now. That buys us time to investigate, after which we'll know the cause and can apologise.

joethreepwood commented 1 month ago

Just a note to self: We've seen repeated issues with these campaigns, some caused by events not triggering and some caused by logic issues, anonymized users, etc.

We've solved the anonymized users issue, but long-term we think the correct way to solve this will be to split the PostHog for Startups and YC campaign into several component campaigns - one for each event. Best case, this avoids logic issues and solves the problem. Worst case: we have problems compartmentalized.

joethreepwood commented 1 month ago

Looks like there are up to 410 affected users.

I think I've figured out why this happened and it was basically a conflux of two issues. The good news is that it won't repeat and that I can easily identify everyone who was impacted.

What happened here is that it was previously known there was an issue with the startup_plan_customer_added and the startup_plan_customer_added_backfill event which was used for people who joined the program before the event was created. The issue basically meant that some people triggered both events, according to @zlwaterfield here.

The solution was simple: within both campaigns, we basically added a step to filter out people who shouldn't be in that campaign and skipped them to the end of the workflow.

This would seem to be fine, except apparently I (judging by the save history) made a change accidentally in the flow which meant that people who were filtered out went to the final message in the flow. There were 410 people this happened to and they all triggered at the same time -- 5:40am BST.

The timestamp here indicates a compounding factor. Basically, normally I could have spotted this as soon as it happened because the metrics in C.io would have immediately spiked. However, every message in the startup flow has a 12 hour delay before it to accomodate the billing system updating. I made the change at 5:40PM BST, but didn't see the spike because all users were immediately put into the 12 hour delay block - at which point they were basically hidden by the users who are correctly in that block. Thus, I didn't spot it.

joethreepwood commented 1 month ago
Screenshot 2024-10-10 at 11 05 17

Draft apology. I've got a segment all set up.

joethreepwood commented 1 month ago
Screenshot 2024-10-10 at 11 10 10

Scheduled