Open lobsterkatie opened 8 months ago
The only other thing is that I assume the only way this can happen is via some kind of transaction rollback situation. Like:
Something to do with merging/unmerging groups. We increment the counter on unmerge I believe, so this seems like a reasonable place to look as well. Maybe we don't handle transactions right there?
To test this theory out, we can look at Activity
, with whatever enum values are merging/unmerging, to see if the timing works.
About a month ago, we started seeing a bunch of
IntegrityError
s in thestore.save_event
task. More specifically, they wereUniqueViolation('duplicate key value violates unique constraint "sentry_groupedmessage_project_id_<random internal hexcode>_uniq"\nDETAIL: Key (project_id, short_id)=(12311121, 41598) already exists.
errors. (Legal folks, don't worry - those aren't real numbers, they're my dogs' birthdays and adoption dates.)@wedamija eventually tracked it down to the projects'
Counter
andGroup.short_id
values being out of sync. Each project has one row in thesentry_projectcounter
table, whose job it is to keep track of the most recent/highestshort_id
assigned to an issue in that project, such that a new group coming in can take theshort_id
one higher than that, and never run into a conflict. (We have a unique constraint on theproject_id
/short_id
pairing in thesentry_groupedmessage
table.) Normally, that works great.For each project throwing the
UniqueViolation
errors, however, the project counter value had fallen one behind the highest assignedshort_id
. Therefore, every time we tried to create a new issue for that project, we tried to give it what the project counter thought was the next availableshort_id
, only to find that thatshort_id
was already taken by the most recently created existing issue. At that point we'd just error out, so the existing newestshort_id
would stay newest, the project counter wouldn't get incremented, and the group wouldn't get created. Then the next novel event would come in, we'd try to create a group for it, all the same things would happen, and we'd be effectively stuck. Projects in this state can still accept events into existing issues, but can't create new any ones.Steps to resolve this:
[x] Manually fix broken projects by forcing the counter to have the right value. When this was a one-off, we figured it was gremlins and hoped it never came back. Then it did, so we fixed those projects, and then... it was clear we needed a better solution. PRs for those manual fixes:
[x] Add a workaround, so that any project found to be stuck is immediately unstuck and no new groups are dropped. PRs for that:
[ ] Figure out how common this is, using this DD query and this GCP logs query. (NOTE: You may have to update the timeframes on each of these, as I can't seem to get them to cover a generic timeframe which will update itself.)
[ ] Figure out how the heck our data is getting out of whack in the first place.
Two ideas from @wedamija:
Simplify the code we use to increment the project counter, so that it's a simple update. This would require creating a
Counter
record for each project when it's created, and backfilling any projects which currently don't have aCounter
.The timing of these errors appearing lines up pretty well with our upgrade to Django 5. See if we can figure out anything that changed there which might be causing this.