TestAlertmanagerSharding is flaky due to a logic issue

pracucci commented 3 years ago

Describe the bug The TestAlertmanagerSharding, which was updated in #3839, is flaky. As an example, you can see it here and here.

To Reproduce 've reproduced it locally with debug logs. This is a snippet of logs:

# we create the silence here

17:57:39 alertmanager-3: level=debug ts=2021-03-08T16:57:39.9726694Z caller=multitenant.go:929 component=MultiTenantAlertmanager msg="user does not have an alertmanager in this instance" user=user-5
17:57:39 alertmanager-1: level=debug ts=2021-03-08T16:57:39.9740552Z caller=multitenant.go:907 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5

# the periodic sync realise the shard has changed and loads user-5 too, but it's "too late"

17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1726301Z caller=multitenant.go:673 component=MultiTenantAlertmanager msg="setting config" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1730083Z caller=multitenant.go:726 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1732274Z caller=alertmanager.go:139 user=user-5 msg="starting tenant alertmanager with ring-based replication"

The problem is that if the silence is created soon after a resharding, the replication may fail.

Expected behavior The replication should not fail if happening right after a resharding.

pracucci commented 3 years ago

When we receive a broadcast state message and we don't have an alertmanager instance for that tenant, why don't we just create it?

pracucci commented 3 years ago

When we receive a broadcast state message and we don't have an alertmanager instance for that tenant, why don't we just create it?

After a brief discussion with @gotjosh, this problem should go away with the full state request on creation, which is something we're currently working on.

stevesg commented 3 years ago

This should be fixed by #3925. I will re-test after the remaining review comments are addressed and we've got it merged, then re-enable the test.

Reproducing the same failure as indicated above: (fails 3-5 times per 20 runs)

09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.139725934Z caller=multitenant.go:979 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.139755268Z caller=logging.go:66 traceID=f45c15938b27753 msg="POST /api/prom/api/v1/silences (200) 7.958334ms"

...

09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.156434679Z caller=multitenant.go:1004 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5

With the changes:

... silence posted

09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.055128761Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.055184005Z caller=logging.go:66 traceID=1e6763fc033bf8cd msg="POST /api/prom/api/v1/silences (200) 766.924µs"

... replicated to instance-3, ignored

09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123309783Z caller=multitenant.go:1100 component=MultiTenantAlertmanager msg="user does not have an alertmanager in this instance" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123450931Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=67.549728ms msg="gRPC (success)"

... instance-3 configured with user-5

09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123331643Z caller=multitenant.go:762 component=MultiTenantAlertmanager msg="setting config" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123646766Z caller=multitenant.go:815 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123926409Z caller=alertmanager.go:163 user=user-5 msg="starting tenant alertmanager with ring-based replication"

... replication failure logged on instance-2

09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.123951203Z caller=multitenant.go:1005 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5

... alertmanager-3 starts up, syncs from alertmanager-2

09:59:48 alertmanager-3: level=info ts=2021-03-19T08:59:48.124248656Z caller=state_replication.go:162 user=user-5 msg="Waiting for notification and silences to settle..."
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.124334281Z caller=multitenant.go:1039 component=MultiTenantAlertmanager msg="contacting replica for full state" user=user-5 addr=192.168.112.5:9095

... call into alertmanager-2

09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.124808712Z caller=grpc_logging.go:41 duration=81.574µs method=/alertmanagerpb.Alertmanager/ReadState msg="gRPC (success)"

... silence obtained and stored in alertmanager-3

09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.125051759Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=sil:user-5 bytes=149
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.125362691Z caller=silence.go:834 user=user-5 component=silences msg="Gossiping new silence" silence="silence:<id:\"9b89648b-e59e-4cd2-8601-38b19c1be6e4\" matchers:<name:\"instance\" pattern:\"prometheus-one\" > starts_at:<seconds:1616144388 nanos:54731435 > ends_at:<seconds:1616147988 nanos:52825402 > updated_at:<seconds:1616144388 nanos:54731435 > comment:\"Created for a test case.\" > expires_at:<seconds:1616579988 nanos:52825402 > "
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.128973544Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=nfl:user-5 bytes=0
09:59:48 alertmanager-3: level=info ts=2021-03-19T08:59:48.129008675Z caller=state_replication.go:179 user=user-5 msg="state settled; proceeding" attempt=1

... test passes later on

The passing cases still pass of course, but you can see that initial sync doesn't yield anything.

10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.598941981Z caller=multitenant.go:762 component=MultiTenantAlertmanager msg="setting config" user=user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.59912224Z caller=multitenant.go:815 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.599438481Z caller=alertmanager.go:163 user=user-5 msg="starting tenant alertmanager with ring-based replication"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.599898105Z caller=logging.go:66 traceID=2c52269b82a08046 msg="GET /metrics (200) 17.440718ms"
10:00:04 alertmanager-3: level=info ts=2021-03-19T09:00:04.600024238Z caller=state_replication.go:162 user=user-5 msg="Waiting for notification and silences to settle..."
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600112448Z caller=multitenant.go:1039 component=MultiTenantAlertmanager msg="contacting replica for full state" user=user-5 addr=192.168.128.4:9095
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.600577869Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/ReadState duration=53.847µs msg="gRPC (success)"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600869944Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=nfl:user-5 bytes=0
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600975544Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=sil:user-5 bytes=0
10:00:04 alertmanager-3: level=info ts=2021-03-19T09:00:04.601014445Z caller=state_replication.go:179 user=user-5 msg="state settled; proceeding" attempt=1

... later on, replication succeeds normally

10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.698264402Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.698333056Z caller=logging.go:66 traceID=13772ea2e5ec76c msg="POST /api/prom/api/v1/silences (200) 717.408µs"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700260111Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700182658Z caller=silence.go:834 user=user-5 component=silences msg="Gossiping new silence" silence="silence:<id:\"68d7aa38-1a95-4426-b176-8e2e592b10dc\" matchers:<name:\"instance\" pattern:\"prometheus-one\" > starts_at:<seconds:1616144404 nanos:697948580 > ends_at:<seconds:1616148004 nanos:696989453 > updated_at:<seconds:1616144404 nanos:697948580 > comment:\"Created for a test case.\" > expires_at:<seconds:1616580004 nanos:696989453 > "
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700356492Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=465.002µs msg="gRPC (success)"
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.70083225Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=95.054µs msg="gRPC (success)"

pracucci commented 3 years ago

TestAlertmanagerSharding is still flaky. I've got a CI run failing with this output:

2021-06-22T15:01:42.5842285Z === RUN   TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5845071Z 14:41:44 Starting consul
2021-06-22T15:01:42.5845509Z 14:41:45 consul: ==> Starting Consul agent...
2021-06-22T15:01:42.5846185Z 14:41:45 consul: Version: '1.8.4'
2021-06-22T15:01:42.5851377Z 14:41:45 consul: Node ID: 'aece4fca-c8f9-0bf9-a4bf-34cdc2fdc092'
2021-06-22T15:01:42.5852201Z 14:41:45 consul: Node name: 'consul'
2021-06-22T15:01:42.5852825Z 14:41:45 consul: Datacenter: 'dc1' (Segment: '<all>')
2021-06-22T15:01:42.5853375Z 14:41:45 consul: Server: true (Bootstrap: false)
2021-06-22T15:01:42.5854121Z 14:41:45 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
2021-06-22T15:01:42.5854659Z 14:41:45 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
2021-06-22T15:01:42.5855599Z 14:41:45 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
2021-06-22T15:01:42.5856380Z 14:41:45 consul: ==> Log data will now stream in as it occurs:
2021-06-22T15:01:42.5856879Z 14:41:45 consul: ==> Consul agent running!
2021-06-22T15:01:42.5860826Z 14:41:45 Ports for container: e2e-cortex-test-consul Mapping: map[8500:49233]
2021-06-22T15:01:42.5861625Z 14:41:45 Starting minio-9000
2021-06-22T15:01:42.5862456Z 14:41:46 minio-9000: Attempting encryption of all config, IAM users and policies on MinIO backend
2021-06-22T15:01:42.5863529Z 14:41:46 Ports for container: e2e-cortex-test-minio-9000 Mapping: map[9000:49235]
2021-06-22T15:01:42.5864356Z 14:41:47 Starting alertmanager-1
2021-06-22T15:01:42.5865490Z 14:41:48 alertmanager-1: level=warn ts=2021-06-22T14:41:48.208554093Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5866854Z 14:41:48 Ports for container: e2e-cortex-test-alertmanager-1 Mapping: map[80:49241 9094:49239 9095:49237]
2021-06-22T15:01:42.5867730Z 14:41:49 Starting alertmanager-2
2021-06-22T15:01:42.5868911Z 14:41:50 alertmanager-2: level=warn ts=2021-06-22T14:41:50.05556098Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5870286Z 14:41:50 Ports for container: e2e-cortex-test-alertmanager-2 Mapping: map[80:49247 9094:49245 9095:49243]
2021-06-22T15:01:42.5871334Z 14:41:52 Starting alertmanager-3
2021-06-22T15:01:42.5872637Z 14:41:52 alertmanager-3: level=warn ts=2021-06-22T14:41:52.861184798Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5873995Z 14:41:52 Ports for container: e2e-cortex-test-alertmanager-3 Mapping: map[80:49253 9094:49251 9095:49249]
2021-06-22T15:01:42.5874756Z     alertmanager_test.go:544: 
2021-06-22T15:01:42.5875254Z            Error Trace:    alertmanager_test.go:544
2021-06-22T15:01:42.5875759Z            Error:          elements differ
2021-06-22T15:01:42.5876137Z                            
2021-06-22T15:01:42.5876516Z                            extra elements in list A:
2021-06-22T15:01:42.5876968Z                            ([]interface {}) (len=1) {
2021-06-22T15:01:42.5877385Z                             (string) (len=7) "alert_3"
2021-06-22T15:01:42.5877822Z                            }
2021-06-22T15:01:42.5878115Z                            
2021-06-22T15:01:42.5878414Z                            
2021-06-22T15:01:42.5878727Z                            listA:
2021-06-22T15:01:42.5879093Z                            ([]string) (len=3) {
2021-06-22T15:01:42.5879486Z                             (string) (len=7) "alert_1",
2021-06-22T15:01:42.5879912Z                             (string) (len=7) "alert_2",
2021-06-22T15:01:42.5880321Z                             (string) (len=7) "alert_3"
2021-06-22T15:01:42.5880685Z                            }
2021-06-22T15:01:42.5880985Z                            
2021-06-22T15:01:42.5881276Z                            
2021-06-22T15:01:42.5881592Z                            listB:
2021-06-22T15:01:42.5881941Z                            ([]string) (len=2) {
2021-06-22T15:01:42.5882347Z                             (string) (len=7) "alert_1",
2021-06-22T15:01:42.5882789Z                             (string) (len=7) "alert_2"
2021-06-22T15:01:42.5883150Z                            }
2021-06-22T15:01:42.5883660Z            Test:           TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5884258Z     alertmanager_test.go:544: 
2021-06-22T15:01:42.5884867Z            Error Trace:    alertmanager_test.go:544
2021-06-22T15:01:42.5885369Z            Error:          elements differ
2021-06-22T15:01:42.5885735Z                            
2021-06-22T15:01:42.5886125Z                            extra elements in list A:
2021-06-22T15:01:42.5886579Z                            ([]interface {}) (len=1) {
2021-06-22T15:01:42.5886996Z                             (string) (len=7) "alert_3"
2021-06-22T15:01:42.5887360Z                            }
2021-06-22T15:01:42.5887652Z                            
2021-06-22T15:01:42.5887951Z                            
2021-06-22T15:01:42.5888263Z                            listA:
2021-06-22T15:01:42.5888626Z                            ([]string) (len=3) {
2021-06-22T15:01:42.5889018Z                             (string) (len=7) "alert_1",
2021-06-22T15:01:42.5889440Z                             (string) (len=7) "alert_2",
2021-06-22T15:01:42.5889854Z                             (string) (len=7) "alert_3"
2021-06-22T15:01:42.5890213Z                            }
2021-06-22T15:01:42.5890516Z                            
2021-06-22T15:01:42.5890807Z                            
2021-06-22T15:01:42.5891126Z                            listB:
2021-06-22T15:01:42.5891476Z                            ([]string) (len=2) {
2021-06-22T15:01:42.5891884Z                             (string) (len=7) "alert_1",
2021-06-22T15:01:42.5892293Z                             (string) (len=7) "alert_2"
2021-06-22T15:01:42.5892655Z                            }
2021-06-22T15:01:42.5893161Z            Test:           TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5893936Z 14:41:56 Killing alertmanager-3
2021-06-22T15:01:42.5894530Z 14:41:56 Killing alertmanager-2
2021-06-22T15:01:42.5895130Z 14:41:56 Killing alertmanager-1
2021-06-22T15:01:42.5895680Z 14:41:56 Killing minio-9000
2021-06-22T15:01:42.5896074Z 14:41:57 Killing consul
2021-06-22T15:01:42.5896749Z --- FAIL: TestAlertmanagerSharding (24.97s)
2021-06-22T15:01:42.5899663Z     --- PASS: TestAlertmanagerSharding/RF_=_2 (11.99s)
2021-06-22T15:01:42.5900559Z     --- FAIL: TestAlertmanagerSharding/RF_=_3 (12.98s)

cortexproject / cortex

TestAlertmanagerSharding is flaky due to a logic issue #3927