crons: same timeline is shown twice

JeremiaAu commented 1 month ago

Self-Hosted Version

24.5.0

CPU Architecture

x86_64

Docker Version

26.1.3

Docker Compose Version

2.27.0

Steps to Reproduce

Upgrade from version 24.4.1 to 24.5.0
(Wait for new check-ins)
Open Crons

Expected Result

Only one row per environment

Actual Result

Event ID

No response

getsantry[bot] commented 3 weeks ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 month ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 month ago

Routing to @getsentry/product-owners-crons for triage ⏲️

evanpurkhiser commented 1 month ago

Would you be able to share the API response for the monitors/ API request?

Specifically I'm interested in if the [].evironments key has two environments in it, or if this is a UI bug

JeremiaAu commented 1 month ago

Hey, sorry it took me so long, there really are two production environments in the API response:

JeremiaAu commented 1 month ago

I just created another issue that might be related: https://github.com/getsentry/self-hosted/issues/3104

wedamija commented 1 month ago

Hi @JeremiaAu

It's pretty unusual that you have duplicated environments here. We have unique constraints to prevent this type of duplication - the main way I could think of this failing is that potentially you have environments with the same name in multiple organizations in your self hosted instance, and possibly you migrated the monitor over and something went wrong with the migration.

Does that sound like something that would be possible in your set up?

JeremiaAu commented 1 month ago

Hey, @wedamija,

We have only one organization on our server.

What might also be remarkable is, that this issue affects all three crons we currently have running in our organization.

wedamija commented 1 month ago

Hi @JeremiaAu, sorry for the delay in response here.

It might be most helpful to have a look at some of your data here. if you're comfortable running some sql queries, you can post the results here or email them to dfuller@sentry.io if you would prefer them not be public.

Firstly, I'd like to see what is in your environments table: select * from sentry_environment

I'd also like to see the environments associated with one of the crons with the duplication problem

select sme.* 
from sentry_monitorenvironment sme
inner join sentry_monitor sm on sm.id = sme.monitor_id
where sm.slug in (<monitor_slug>)

JeremiaAu commented 1 month ago

Hey @wedamija,

I have sent you an e-mail.

I have also gotten around to applying the fix from the other issue (https://github.com/getsentry/self-hosted/issues/3104). The updated screenshot now looks like this:

wedamija commented 1 month ago

Ok, your problem here is quite weird - you have duplicate environments in your environment table, which is likely causing this problem. There should be a unique constraint in place to prevent this, so possibly something has gone wrong and removed the constraint.

Could you run \d sentry_environment and post/email the description?

I would expect to see an index like "sentry_environment_organization_id_name_95a37dc7_uniq" UNIQUE CONSTRAINT, btree (organization_id, name) on your table, possibly the name might be slightly different.

Could you also run select organization_id, name, count(*) from sentry_environment group by organization_id, name and email me through the results? I want to confirm that the strings are also identical.

wedamija commented 1 month ago

Based on the data in your system, it looks like there must be some kind of corruption with the unique constraint on (organization_id, name) that is causing it to not enforce the constraint. I'm not sure what caused it, but basically we need to figure out how to clean up your environment data to correct these duplicates. I'm going to discuss this internally and figure out the best person to help with this.

hubertdeng123 commented 1 month ago

I wonder if the reason behind this corruption with the unique constraint is because in between 24.4.1 and 24.5.0 we upgraded postgres to 14, and we started using the alpine image instead. Changing the OS might have lead to some issues here. I wonder if cleaning the duplicates up and then perhaps using the postgres:14 image instead might work?

JeremiaAu commented 1 month ago

I also think that the postgres issue mentioned in https://github.com/getsentry/self-hosted/issues/3107 is to blame.

I ran the following command, meant to identify broken indices, and the sentry_environment_organization_id_name_95a37dc7_uniq showed up.

SELECT DISTINCT indrelid::regclass::text, indexrelid::regclass::text, collname, pg_get_indexdef(indexrelid) 
FROM (SELECT indexrelid, indrelid, indcollation[i] coll FROM pg_index, generate_subscripts(indcollation, 1) g(i)) s 
  JOIN pg_collation c ON coll=c.oid
WHERE collprovider IN ('d', 'c') AND collname NOT IN ('C', 'POSIX');

source: https://wiki.postgresql.org/wiki/Locale_data_changes#What_to_do

hubertdeng123 commented 1 month ago

Yep, I have a feeling that is the case too. We've changed the postgres image used back to a debian based image here. Do you happen to have a backup of your postgres data before you upgraded? If so, depending on your needs it may be better to restore that data.

Otherwise, I think there are a few roads forward from here. It may be a good idea to perform a backup proceeding.

Delete all data in the duplicate environment. This may include legitimate data, and if there is legitimate data we'd want to set data there to the original environment id. This may prove to be a manual process, since we don't have foreign keys for everything that references the environment.
Afterwards, reindex the broken indices.

JeremiaAu commented 1 month ago

Sadly, I do not have a sentry back up that old. But I would not really mind loosing the data generated since the upgrade.

Can you provide postgresql commands for deleting the duplicate environments and associated data, or point me to the relevant docs?

hubertdeng123 commented 1 month ago

Note: We do not have an official guide for this and I am not sure if these instructions I'm giving you is completely comprehensive. This is not guaranteed to work and could result in data loss!

Looks like these models in Sentry are the ones that have a reference to an environment_id. By environment_id

sentry_deploy
sentry_latestrelease
sentry_rule
sentry_userreport

So, I'd probably try something like

DELETE from sentry_environment WHERE id="$duplicate_environment_id" (pick the duplicate environment with the higher id)
DELETE from sentry_deploy WHERE environment_id="$duplicate_environment_id"
DELETE from sentry_latestrelease WHERE environment_id="$duplicate_environment_id"
DELETE from sentry_rule WHERE environment_id="$duplicate_environment_id"
DELETE from sentry_userreport WHERE environment_id="$duplicate_environment_id"
REINDEX INDEX sentry_environment_organization_id_name_95a37dc7_uniq;

I believe the models with foreign key relations should be cleaned up automatically.

JeremiaAu commented 3 weeks ago

Hey @hubertdeng123, I just got around to applying your suggested fix.

The Models with foreign key relations were not cleaned up automatically, so I had to expand the delete commands:

DELETE from sentry_deploy WHERE environment_id='8';
DELETE from sentry_latestrelease WHERE environment_id='8';
DELETE from sentry_rule WHERE environment_id='8';
DELETE from sentry_userreport WHERE environment_id='8';

DELETE from sentry_environmentproject WHERE environment_id='8';
DELETE from sentry_releaseprojectenvironment WHERE environment_id='8';
DELETE from sentry_environment WHERE id='8';
REINDEX INDEX sentry_environment_organization_id_name_95a37dc7_uniq;

Unfortunately Sentry is behaving weirdly now.

The cron overview does not show check-ins across environments: (But the check mark or fire symbol are displayed correctly)

I also can't access individual crons without specifying one environment as it only returns the error message "The monitor you were looking for was not found"

Viewing a single environment does work:

Thanks for your help so far! Should I create a new Issue?

getsantry[bot] commented 3 weeks ago

Routing to @getsentry/product-owners-crons for triage ⏲️

hubertdeng123 commented 3 weeks ago

Are there any logs that may give us a clue to why you're getting The monitor you were looking for was not found? I suspect there is something we're missing but I'm not sure what it is.

JeremiaAu commented 3 weeks ago

Where can I find the log you are mentioning?

hubertdeng123 commented 2 weeks ago

I would be curious what the command docker compose logs web shows. Hopefully that should give some information?

getsentry / self-hosted