getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
39.07k stars 4.2k forks source link

Project counters get stuck and projects can't create new issues #65745

Open lobsterkatie opened 8 months ago

lobsterkatie commented 8 months ago

About a month ago, we started seeing a bunch of IntegrityErrors in the store.save_event task. More specifically, they were UniqueViolation('duplicate key value violates unique constraint "sentry_groupedmessage_project_id_<random internal hexcode>_uniq"\nDETAIL: Key (project_id, short_id)=(12311121, 41598) already exists. errors. (Legal folks, don't worry - those aren't real numbers, they're my dogs' birthdays and adoption dates.)

@wedamija eventually tracked it down to the projects' Counter and Group.short_id values being out of sync. Each project has one row in the sentry_projectcounter table, whose job it is to keep track of the most recent/highest short_id assigned to an issue in that project, such that a new group coming in can take the short_id one higher than that, and never run into a conflict. (We have a unique constraint on the project_id/short_id pairing in the sentry_groupedmessage table.) Normally, that works great.

For each project throwing the UniqueViolation errors, however, the project counter value had fallen one behind the highest assigned short_id. Therefore, every time we tried to create a new issue for that project, we tried to give it what the project counter thought was the next available short_id, only to find that that short_id was already taken by the most recently created existing issue. At that point we'd just error out, so the existing newest short_id would stay newest, the project counter wouldn't get incremented, and the group wouldn't get created. Then the next novel event would come in, we'd try to create a group for it, all the same things would happen, and we'd be effectively stuck. Projects in this state can still accept events into existing issues, but can't create new any ones.

Steps to resolve this:

Two ideas from @wedamija:

wedamija commented 8 months ago

The only other thing is that I assume the only way this can happen is via some kind of transaction rollback situation. Like:

lobsterkatie commented 8 months ago

Something to do with merging/unmerging groups. We increment the counter on unmerge I believe, so this seems like a reasonable place to look as well. Maybe we don't handle transactions right there?

To test this theory out, we can look at Activity, with whatever enum values are merging/unmerging, to see if the timing works.