Closed pracucci closed 3 years ago
When we receive a broadcast state message and we don't have an alertmanager instance for that tenant, why don't we just create it?
When we receive a broadcast state message and we don't have an alertmanager instance for that tenant, why don't we just create it?
After a brief discussion with @gotjosh, this problem should go away with the full state request on creation, which is something we're currently working on.
This should be fixed by #3925. I will re-test after the remaining review comments are addressed and we've got it merged, then re-enable the test.
Reproducing the same failure as indicated above: (fails 3-5 times per 20 runs)
09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.139725934Z caller=multitenant.go:979 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.139755268Z caller=logging.go:66 traceID=f45c15938b27753 msg="POST /api/prom/api/v1/silences (200) 7.958334ms"
...
09:50:15 alertmanager-2: level=debug ts=2021-03-19T08:50:15.156434679Z caller=multitenant.go:1004 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5
With the changes:
... silence posted
09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.055128761Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.055184005Z caller=logging.go:66 traceID=1e6763fc033bf8cd msg="POST /api/prom/api/v1/silences (200) 766.924µs"
... replicated to instance-3, ignored
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123309783Z caller=multitenant.go:1100 component=MultiTenantAlertmanager msg="user does not have an alertmanager in this instance" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123450931Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=67.549728ms msg="gRPC (success)"
... instance-3 configured with user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123331643Z caller=multitenant.go:762 component=MultiTenantAlertmanager msg="setting config" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123646766Z caller=multitenant.go:815 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.123926409Z caller=alertmanager.go:163 user=user-5 msg="starting tenant alertmanager with ring-based replication"
... replication failure logged on instance-2
09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.123951203Z caller=multitenant.go:1005 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5
... alertmanager-3 starts up, syncs from alertmanager-2
09:59:48 alertmanager-3: level=info ts=2021-03-19T08:59:48.124248656Z caller=state_replication.go:162 user=user-5 msg="Waiting for notification and silences to settle..."
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.124334281Z caller=multitenant.go:1039 component=MultiTenantAlertmanager msg="contacting replica for full state" user=user-5 addr=192.168.112.5:9095
... call into alertmanager-2
09:59:48 alertmanager-2: level=debug ts=2021-03-19T08:59:48.124808712Z caller=grpc_logging.go:41 duration=81.574µs method=/alertmanagerpb.Alertmanager/ReadState msg="gRPC (success)"
... silence obtained and stored in alertmanager-3
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.125051759Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=sil:user-5 bytes=149
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.125362691Z caller=silence.go:834 user=user-5 component=silences msg="Gossiping new silence" silence="silence:<id:\"9b89648b-e59e-4cd2-8601-38b19c1be6e4\" matchers:<name:\"instance\" pattern:\"prometheus-one\" > starts_at:<seconds:1616144388 nanos:54731435 > ends_at:<seconds:1616147988 nanos:52825402 > updated_at:<seconds:1616144388 nanos:54731435 > comment:\"Created for a test case.\" > expires_at:<seconds:1616579988 nanos:52825402 > "
09:59:48 alertmanager-3: level=debug ts=2021-03-19T08:59:48.128973544Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=nfl:user-5 bytes=0
09:59:48 alertmanager-3: level=info ts=2021-03-19T08:59:48.129008675Z caller=state_replication.go:179 user=user-5 msg="state settled; proceeding" attempt=1
... test passes later on
The passing cases still pass of course, but you can see that initial sync doesn't yield anything.
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.598941981Z caller=multitenant.go:762 component=MultiTenantAlertmanager msg="setting config" user=user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.59912224Z caller=multitenant.go:815 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.599438481Z caller=alertmanager.go:163 user=user-5 msg="starting tenant alertmanager with ring-based replication"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.599898105Z caller=logging.go:66 traceID=2c52269b82a08046 msg="GET /metrics (200) 17.440718ms"
10:00:04 alertmanager-3: level=info ts=2021-03-19T09:00:04.600024238Z caller=state_replication.go:162 user=user-5 msg="Waiting for notification and silences to settle..."
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600112448Z caller=multitenant.go:1039 component=MultiTenantAlertmanager msg="contacting replica for full state" user=user-5 addr=192.168.128.4:9095
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.600577869Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/ReadState duration=53.847µs msg="gRPC (success)"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600869944Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=nfl:user-5 bytes=0
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.600975544Z caller=state_replication.go:208 user=user-5 msg="merging full state" user=user-5 key=sil:user-5 bytes=0
10:00:04 alertmanager-3: level=info ts=2021-03-19T09:00:04.601014445Z caller=state_replication.go:179 user=user-5 msg="state settled; proceeding" attempt=1
... later on, replication succeeds normally
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.698264402Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.698333056Z caller=logging.go:66 traceID=13772ea2e5ec76c msg="POST /api/prom/api/v1/silences (200) 717.408µs"
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700260111Z caller=multitenant.go:980 component=MultiTenantAlertmanager msg="message received for replication" user=user-5 key=sil:user-5
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700182658Z caller=silence.go:834 user=user-5 component=silences msg="Gossiping new silence" silence="silence:<id:\"68d7aa38-1a95-4426-b176-8e2e592b10dc\" matchers:<name:\"instance\" pattern:\"prometheus-one\" > starts_at:<seconds:1616144404 nanos:697948580 > ends_at:<seconds:1616148004 nanos:696989453 > updated_at:<seconds:1616144404 nanos:697948580 > comment:\"Created for a test case.\" > expires_at:<seconds:1616580004 nanos:696989453 > "
10:00:04 alertmanager-3: level=debug ts=2021-03-19T09:00:04.700356492Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=465.002µs msg="gRPC (success)"
10:00:04 alertmanager-1: level=debug ts=2021-03-19T09:00:04.70083225Z caller=grpc_logging.go:41 method=/alertmanagerpb.Alertmanager/UpdateState duration=95.054µs msg="gRPC (success)"
TestAlertmanagerSharding
is still flaky. I've got a CI run failing with this output:
2021-06-22T15:01:42.5842285Z === RUN TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5845071Z 14:41:44 Starting consul
2021-06-22T15:01:42.5845509Z 14:41:45 consul: ==> Starting Consul agent...
2021-06-22T15:01:42.5846185Z 14:41:45 consul: Version: '1.8.4'
2021-06-22T15:01:42.5851377Z 14:41:45 consul: Node ID: 'aece4fca-c8f9-0bf9-a4bf-34cdc2fdc092'
2021-06-22T15:01:42.5852201Z 14:41:45 consul: Node name: 'consul'
2021-06-22T15:01:42.5852825Z 14:41:45 consul: Datacenter: 'dc1' (Segment: '<all>')
2021-06-22T15:01:42.5853375Z 14:41:45 consul: Server: true (Bootstrap: false)
2021-06-22T15:01:42.5854121Z 14:41:45 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
2021-06-22T15:01:42.5854659Z 14:41:45 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
2021-06-22T15:01:42.5855599Z 14:41:45 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
2021-06-22T15:01:42.5856380Z 14:41:45 consul: ==> Log data will now stream in as it occurs:
2021-06-22T15:01:42.5856879Z 14:41:45 consul: ==> Consul agent running!
2021-06-22T15:01:42.5860826Z 14:41:45 Ports for container: e2e-cortex-test-consul Mapping: map[8500:49233]
2021-06-22T15:01:42.5861625Z 14:41:45 Starting minio-9000
2021-06-22T15:01:42.5862456Z 14:41:46 minio-9000: Attempting encryption of all config, IAM users and policies on MinIO backend
2021-06-22T15:01:42.5863529Z 14:41:46 Ports for container: e2e-cortex-test-minio-9000 Mapping: map[9000:49235]
2021-06-22T15:01:42.5864356Z 14:41:47 Starting alertmanager-1
2021-06-22T15:01:42.5865490Z 14:41:48 alertmanager-1: level=warn ts=2021-06-22T14:41:48.208554093Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5866854Z 14:41:48 Ports for container: e2e-cortex-test-alertmanager-1 Mapping: map[80:49241 9094:49239 9095:49237]
2021-06-22T15:01:42.5867730Z 14:41:49 Starting alertmanager-2
2021-06-22T15:01:42.5868911Z 14:41:50 alertmanager-2: level=warn ts=2021-06-22T14:41:50.05556098Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5870286Z 14:41:50 Ports for container: e2e-cortex-test-alertmanager-2 Mapping: map[80:49247 9094:49245 9095:49243]
2021-06-22T15:01:42.5871334Z 14:41:52 Starting alertmanager-3
2021-06-22T15:01:42.5872637Z 14:41:52 alertmanager-3: level=warn ts=2021-06-22T14:41:52.861184798Z caller=experimental.go:19 msg="experimental feature in use" feature="Alertmanager sharding"
2021-06-22T15:01:42.5873995Z 14:41:52 Ports for container: e2e-cortex-test-alertmanager-3 Mapping: map[80:49253 9094:49251 9095:49249]
2021-06-22T15:01:42.5874756Z alertmanager_test.go:544:
2021-06-22T15:01:42.5875254Z Error Trace: alertmanager_test.go:544
2021-06-22T15:01:42.5875759Z Error: elements differ
2021-06-22T15:01:42.5876137Z
2021-06-22T15:01:42.5876516Z extra elements in list A:
2021-06-22T15:01:42.5876968Z ([]interface {}) (len=1) {
2021-06-22T15:01:42.5877385Z (string) (len=7) "alert_3"
2021-06-22T15:01:42.5877822Z }
2021-06-22T15:01:42.5878115Z
2021-06-22T15:01:42.5878414Z
2021-06-22T15:01:42.5878727Z listA:
2021-06-22T15:01:42.5879093Z ([]string) (len=3) {
2021-06-22T15:01:42.5879486Z (string) (len=7) "alert_1",
2021-06-22T15:01:42.5879912Z (string) (len=7) "alert_2",
2021-06-22T15:01:42.5880321Z (string) (len=7) "alert_3"
2021-06-22T15:01:42.5880685Z }
2021-06-22T15:01:42.5880985Z
2021-06-22T15:01:42.5881276Z
2021-06-22T15:01:42.5881592Z listB:
2021-06-22T15:01:42.5881941Z ([]string) (len=2) {
2021-06-22T15:01:42.5882347Z (string) (len=7) "alert_1",
2021-06-22T15:01:42.5882789Z (string) (len=7) "alert_2"
2021-06-22T15:01:42.5883150Z }
2021-06-22T15:01:42.5883660Z Test: TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5884258Z alertmanager_test.go:544:
2021-06-22T15:01:42.5884867Z Error Trace: alertmanager_test.go:544
2021-06-22T15:01:42.5885369Z Error: elements differ
2021-06-22T15:01:42.5885735Z
2021-06-22T15:01:42.5886125Z extra elements in list A:
2021-06-22T15:01:42.5886579Z ([]interface {}) (len=1) {
2021-06-22T15:01:42.5886996Z (string) (len=7) "alert_3"
2021-06-22T15:01:42.5887360Z }
2021-06-22T15:01:42.5887652Z
2021-06-22T15:01:42.5887951Z
2021-06-22T15:01:42.5888263Z listA:
2021-06-22T15:01:42.5888626Z ([]string) (len=3) {
2021-06-22T15:01:42.5889018Z (string) (len=7) "alert_1",
2021-06-22T15:01:42.5889440Z (string) (len=7) "alert_2",
2021-06-22T15:01:42.5889854Z (string) (len=7) "alert_3"
2021-06-22T15:01:42.5890213Z }
2021-06-22T15:01:42.5890516Z
2021-06-22T15:01:42.5890807Z
2021-06-22T15:01:42.5891126Z listB:
2021-06-22T15:01:42.5891476Z ([]string) (len=2) {
2021-06-22T15:01:42.5891884Z (string) (len=7) "alert_1",
2021-06-22T15:01:42.5892293Z (string) (len=7) "alert_2"
2021-06-22T15:01:42.5892655Z }
2021-06-22T15:01:42.5893161Z Test: TestAlertmanagerSharding/RF_=_3
2021-06-22T15:01:42.5893936Z 14:41:56 Killing alertmanager-3
2021-06-22T15:01:42.5894530Z 14:41:56 Killing alertmanager-2
2021-06-22T15:01:42.5895130Z 14:41:56 Killing alertmanager-1
2021-06-22T15:01:42.5895680Z 14:41:56 Killing minio-9000
2021-06-22T15:01:42.5896074Z 14:41:57 Killing consul
2021-06-22T15:01:42.5896749Z --- FAIL: TestAlertmanagerSharding (24.97s)
2021-06-22T15:01:42.5899663Z --- PASS: TestAlertmanagerSharding/RF_=_2 (11.99s)
2021-06-22T15:01:42.5900559Z --- FAIL: TestAlertmanagerSharding/RF_=_3 (12.98s)
Describe the bug The
TestAlertmanagerSharding
, which was updated in #3839, is flaky. As an example, you can see it here and here.To Reproduce 've reproduced it locally with debug logs. This is a snippet of logs:
The problem is that if the silence is created soon after a resharding, the replication may fail.
Expected behavior The replication should not fail if happening right after a resharding.