Project counters get stuck and projects can't create new issues

lobsterkatie commented 8 months ago

About a month ago, we started seeing a bunch of IntegrityErrors in the store.save_event task. More specifically, they were UniqueViolation('duplicate key value violates unique constraint "sentry_groupedmessage_project_id_<random internal hexcode>_uniq"\nDETAIL: Key (project_id, short_id)=(12311121, 41598) already exists. errors. (Legal folks, don't worry - those aren't real numbers, they're my dogs' birthdays and adoption dates.)

@wedamija eventually tracked it down to the projects' Counter and Group.short_id values being out of sync. Each project has one row in the sentry_projectcounter table, whose job it is to keep track of the most recent/highest short_id assigned to an issue in that project, such that a new group coming in can take the short_id one higher than that, and never run into a conflict. (We have a unique constraint on the project_id/short_id pairing in the sentry_groupedmessage table.) Normally, that works great.

For each project throwing the UniqueViolation errors, however, the project counter value had fallen one behind the highest assigned short_id. Therefore, every time we tried to create a new issue for that project, we tried to give it what the project counter thought was the next available short_id, only to find that that short_id was already taken by the most recently created existing issue. At that point we'd just error out, so the existing newest short_id would stay newest, the project counter wouldn't get incremented, and the group wouldn't get created. Then the next novel event would come in, we'd try to create a group for it, all the same things would happen, and we'd be effectively stuck. Projects in this state can still accept events into existing issues, but can't create new any ones.

Steps to resolve this:

[x] Manually fix broken projects by forcing the counter to have the right value. When this was a one-off, we figured it was gremlins and hoped it never came back. Then it did, so we fixed those projects, and then... it was clear we needed a better solution. PRs for those manual fixes:
[x] Add a workaround, so that any project found to be stuck is immediately unstuck and no new groups are dropped. PRs for that:
[ ] Figure out how common this is, using this DD query and this GCP logs query. (NOTE: You may have to update the timeframes on each of these, as I can't seem to get them to cover a generic timeframe which will update itself.)
[ ] Figure out how the heck our data is getting out of whack in the first place.

Two ideas from @wedamija:

Simplify the code we use to increment the project counter, so that it's a simple update. This would require creating a Counter record for each project when it's created, and backfilling any projects which currently don't have a Counter.
The timing of these errors appearing lines up pretty well with our upgrade to Django 5. See if we can figure out anything that changed there which might be causing this.

wedamija commented 8 months ago

The only other thing is that I assume the only way this can happen is via some kind of transaction rollback situation. Like:

Event that would create group comes in, increments number, saves group, somehow project counter is rolled back (doesn't seem plausible since in transaction?)
2 events that would create group A & B come in. Due to some race condition, one commits later, or one rolls back, and somehow this breaks the counter (I don't have a solid mechanism for how this would happen)
Something to do with merging/unmerging groups. We increment the counter on unmerge I believe, so this seems like a reasonable place to look as well. Maybe we don't handle transactions right there?

lobsterkatie commented 8 months ago

Something to do with merging/unmerging groups. We increment the counter on unmerge I believe, so this seems like a reasonable place to look as well. Maybe we don't handle transactions right there?

To test this theory out, we can look at Activity, with whatever enum values are merging/unmerging, to see if the timing works.

getsentry / sentry

Project counters get stuck and projects can't create new issues #65745