Closed avongluck-r1soft closed 4 years ago
Testing:
While it isn't Graylog's fault since this is a 3rd party plugin... the ability for a 3rd party plugin to potentially break all of the alerting is a serious issue as "Backup" notification methods will silently fail.
@avongluck-r1soft Are there any alerting or OpsGenie related exceptions in the logs of your Graylog node(s)?
So I tried to narrow the issue down.. but the behavior seems erratic. When does an alert "resolve"? I noticed that editing the conditions and unchecking the "Repeat notifications(optional)" box seems to "resolve" the alert. However the alert refuses to trigger any additional notifications after this point.
(I can re-check repeat notifications and still can't trigger any notifications 30 minutes since the last successful one)
If I restart graylog-server (leaving repeat notifications checked) I get a flood of them after startup.. then a alert is open and doesn't seem to ever resolve.
Here is the alert log:
2017-03-31 17:31:29.108
Graylog sent Full filesystem (Slack alarm callback) notification
2017-03-31 17:31:41.049
Graylog sent Email On-call about full filesystems (Email Alert Callback) notification
2017-03-31 17:44:48.391
Condition is configured to repeat notifications, Graylog will send notifications when evaluating the condition until it is no longer satisfied
2017-03-31 17:44:48.392
Condition is still satisfied, alert is unresolved
So 12 minutes later it is still considered open even though the threshold is 5 minutes.
Full filesystem (Message Count Alert Condition)
Alerting on stream Full filesystems
Configuration: Alert is triggered when there are more than 0 messages in the last 5 minutes. Grace period: 0 minutes. Including last 5 messages in alert notification. Configured to repeat notifications.
20 minutes later, still unresolved. so:
Maybe the OpsGenie notification is at fault somehow and "holds" the alerts open?
2017-03-31 17:31:41.049
Graylog sent Email On-call about full filesystems (Email Alert Callback) notification
2017-03-31 17:57:33.698
Condition is configured to repeat notifications, Graylog will send notifications when evaluating the condition until it is no longer satisfied
2017-03-31 17:57:33.698
Condition is still satisfied, alert is unresolved
Attaching a server.log which doesn't tell us much.
Ok. Here is some solid documented oddness:
2017-03-31 17:31:41.049
Graylog sent Email On-call about full filesystems (Email Alert Callback) notification
2017-03-31 18:12:01.520
Condition is configured to repeat notifications, Graylog will send notifications when evaluating the condition until it is no longer satisfied
2017-03-31 18:12:01.520
Condition is still satisfied, alert is unresolved
Notice as of 18:12 "no notifications since 17:31". However "messages evaluated" shows:
2017-03-31 17:32:37.081 XXX -- WARNING XXX.net (XXX) DiskUsage high = 98 mount_point = /JUST_A_TEST
2017-03-31 17:34:48.928 XXX -- WARNING XXX.net (XXX) DiskUsage high = 98 mount_point = /JUST_A_TEST
2017-03-31 17:42:12.168 XXX -- WARNING XXX.net (XXX) DiskUsage high = 100 mount_point = /storage06
2017-03-31 18:02:39.432 XXX -- WARNING XXX.net (XXX) DiskUsage high = 100 mount_point = /storage06
2017-03-31 18:11:07.225 XXX -- WARNING XXX.net (XXX) DiskUsage high = 98 mount_point = /JUST_A_TEST
So:
^^^ all of that testing was with the OpsGenie plugin installed.
I uninstalled the OpsGenie plugin again and removed the notification, now events seem to always happen for open alerts as expected. No errors seen in server.log up through now.
I painfully realized over the weekend at 3am that "repeat notifiations" means a notification every minute while an alert is open, even if only one condition was matched. With that understanding the comment above still proves to be out-of-line with intended behaviour.
Hi @avongluck-r1soft,
Sorry to read that 😞. We have an issue with grace period and repeat notifications (https://github.com/Graylog2/graylog2-server/issues/3579), and it will be fixed in an upcoming patch release.
Ok. I think this issue is related to odd alert behavior. The alerts never seem to resolve. Screenshots taken at 9:48am the same day:
Alert begin 1 hour 12 minutes ago:
Alert unresolved still after 1 hour 12 minutes:
Only one message evaluated 1 hour 12 minutes ago.
So at this point in time:
Anyone have any idea what's going on here? When should alerts consider themselves closed? I re-enabled the OpsGenie plugin and set it up after upgrading to 2.2.3. The OpsGenie guys said they couldn't reproduce the issue, however.
The first screenshot shows: "Stream had 1 messages in the last 5 minutes" (which shouldn't be true any longer) Maybe "more than 0 messages" is incorrectly configured? Normally "more than 0" means 1 or more... maybe the logic is >= 0 vs > 0 ?
Nah. https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/alerts/types/MessageCountAlertCondition.java#L184 Logic seems fine there (MORE is >)
@avongluck-r1soft: Is this still an issue with the most recent Graylog version (2.2.3
)?
Yeah. Here's an awesome screenshot showing some weird alert state:
Given the configuration, i'd expect the alert to close after 2 minutes. However its been open for 15 with no signs of closing.
Ah. ok. I got some great screenshots showing the problem:
Without OpsGenie plugin installed: (working as expected. Closes after 2 minutes)
With OpsGenie plugin installed and configured: (not working. Alert gets stuck open, no OpsGenie notification (others are listed as triggered, but never seem to fire?)
@avongluck-r1soft Anything interesting in the logs of your Graylog nodes at that time?
Nothing at all interesting in the logs (quiet) I'm not crazy, right? That second screenshot shows an alert in a state it shouldn't be in.
2017-05-16T10:33:12.031-05:00 INFO [CmdLineTool] Loaded plugin: OpsGenieAlarmCallback 1.0.0 [com.opsgenie.plugin.graylog.OpsGenieAlarmCallbackPlugin]
2017-05-16T10:33:12.033-05:00 INFO [CmdLineTool] Loaded plugin: Elastic Beats Input 2.2.3 [org.graylog.plugins.beats.BeatsInputPlugin]
2017-05-16T10:33:12.033-05:00 INFO [CmdLineTool] Loaded plugin: Collector 2.2.3 [org.graylog.plugins.collector.CollectorPlugin]
2017-05-16T10:33:12.034-05:00 INFO [CmdLineTool] Loaded plugin: Enterprise Integration Plugin 2.2.3 [org.graylog.plugins.enterprise_integration.EnterpriseIntegrationPlugin]
2017-05-16T10:33:12.034-05:00 INFO [CmdLineTool] Loaded plugin: MapWidgetPlugin 2.2.3 [org.graylog.plugins.map.MapWidgetPlugin]
2017-05-16T10:33:12.040-05:00 INFO [CmdLineTool] Loaded plugin: Pipeline Processor Plugin 2.2.3 [org.graylog.plugins.pipelineprocessor.ProcessorPlugin]
2017-05-16T10:33:12.040-05:00 INFO [CmdLineTool] Loaded plugin: Anonymous Usage Statistics 2.2.3 [org.graylog.plugins.usagestatistics.UsageStatsPlugin]
2017-05-16T10:33:12.041-05:00 INFO [CmdLineTool] Loaded plugin: Slack 2.4.0 [org.graylog2.plugins.slack.callback.SlackAlarmCallback]
2017-05-16T10:33:12.194-05:00 INFO [CmdLineTool] Running with JVM arguments: -Xms10g -Xmx10g -XX:NewRatio=1 -XX:PermSize=128m -XX:MaxPermSize=256m -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml -Djava.library.path=/usr/share/graylog-server/lib/sigar -Dgraylog2.installation_source=rpm
2017-05-16T10:33:12.324-05:00 INFO [Version] HV000001: Hibernate Validator null
2017-05-16T10:33:13.789-05:00 INFO [InputBufferImpl] Message journal is enabled.
2017-05-16T10:33:13.804-05:00 INFO [NodeId] Node ID: XXX-c572-4717-b7e8-f7cada1266c6
2017-05-16T10:33:13.933-05:00 INFO [LogManager] Loading logs.
2017-05-16T10:33:13.969-05:00 INFO [LogManager] Logs loading complete.
2017-05-16T10:33:13.969-05:00 INFO [KafkaJournal] Initialized Kafka based journal at /var/lib/graylog-server/journal
2017-05-16T10:33:14.083-05:00 INFO [InputBufferImpl] Initialized InputBufferImpl with ring size <65536> and wait strategy <BlockingWaitStrategy>, running 1 parallel message handlers.
2017-05-16T10:33:14.102-05:00 INFO [cluster] Cluster created with settings {hosts=[10.XXX.XXX.XXX:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2017-05-16T10:33:14.143-05:00 INFO [cluster] No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=10.XXX.XXX.XXX:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
2017-05-16T10:33:14.162-05:00 INFO [connection] Opened connection [connectionId{localValue:1, serverValue:157}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:14.164-05:00 INFO [cluster] Monitor thread successfully connected to server with description ServerDescription{address=10.XXX.XXX.XXX:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 2, 12]}, minWireVersion=0, maxWireVersion=4, maxDocumentSize=16777216, roundTripTimeNanos=435048}
2017-05-16T10:33:14.173-05:00 INFO [connection] Opened connection [connectionId{localValue:2, serverValue:158}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:14.399-05:00 INFO [node] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] version[2.4.4], pid[12887], build[fcbb46d/2017-01-03T11:33:16Z]
2017-05-16T10:33:14.399-05:00 INFO [node] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] initializing ...
2017-05-16T10:33:14.404-05:00 INFO [plugins] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] modules [], plugins [graylog-monitor], sites []
2017-05-16T10:33:15.359-05:00 INFO [node] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] initialized
2017-05-16T10:33:15.399-05:00 INFO [ProcessBuffer] Initialized ProcessBuffer with ring size <65536> and wait strategy <BlockingWaitStrategy>.
2017-05-16T10:33:16.497-05:00 INFO [RulesEngineProvider] No static rules file loaded.
2017-05-16T10:33:16.805-05:00 INFO [OutputBuffer] Initialized OutputBuffer with ring size <65536> and wait strategy <BlockingWaitStrategy>.
2017-05-16T10:33:17.544-05:00 INFO [ServerBootstrap] Graylog server 2.2.3+7adc951 starting up
2017-05-16T10:33:17.545-05:00 INFO [ServerBootstrap] JRE: Oracle Corporation 1.8.0_121 on Linux 3.10.0-327.36.3.el7.x86_64
2017-05-16T10:33:17.545-05:00 INFO [ServerBootstrap] Deployment: rpm
2017-05-16T10:33:17.545-05:00 INFO [ServerBootstrap] OS: CentOS Linux 7 (Core) (centos)
2017-05-16T10:33:17.545-05:00 INFO [ServerBootstrap] Arch: amd64
2017-05-16T10:33:17.556-05:00 WARN [DeadEventLoggingListener] Received unhandled event of type <org.graylog2.plugin.lifecycles.Lifecycle> from event bus <AsyncEventBus{graylog-eventbus}>
2017-05-16T10:33:17.567-05:00 INFO [PeriodicalsService] Starting 26 periodicals ...
2017-05-16T10:33:17.568-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ThroughputCalculator] periodical in [0s], polling every [1s].
2017-05-16T10:33:17.569-05:00 INFO [node] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] starting ...
2017-05-16T10:33:17.572-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.AlertScannerThread] periodical in [10s], polling every [60s].
2017-05-16T10:33:17.575-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread] periodical in [0s], polling every [1s].
2017-05-16T10:33:17.576-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ClusterHealthCheckThread] periodical in [120s], polling every [20s].
2017-05-16T10:33:17.576-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ContentPackLoaderPeriodical] periodical, running forever.
2017-05-16T10:33:17.576-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.GarbageCollectionWarningThread] periodical, running forever.
2017-05-16T10:33:17.576-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexerClusterCheckerThread] periodical in [0s], polling every [30s].
2017-05-16T10:33:17.577-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRetentionThread] periodical in [0s], polling every [300s].
2017-05-16T10:33:17.577-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRotationThread] periodical in [0s], polling every [10s].
2017-05-16T10:33:17.577-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.NodePingThread] periodical in [0s], polling every [1s].
2017-05-16T10:33:17.577-05:00 INFO [IndexRetentionThread] Elasticsearch cluster not available, skipping index retention checks.
2017-05-16T10:33:17.586-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.VersionCheckThread] periodical in [300s], polling every [1800s].
2017-05-16T10:33:17.586-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ThrottleStateUpdaterThread] periodical in [1s], polling every [1s].
2017-05-16T10:33:17.587-05:00 INFO [Periodicals] Starting [org.graylog2.events.ClusterEventPeriodical] periodical in [0s], polling every [1s].
2017-05-16T10:33:17.587-05:00 INFO [Periodicals] Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [86400s].
2017-05-16T10:33:17.588-05:00 INFO [connection] Opened connection [connectionId{localValue:4, serverValue:160}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:17.590-05:00 INFO [connection] Opened connection [connectionId{localValue:3, serverValue:159}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:17.590-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ClusterIdGeneratorPeriodical] periodical, running forever.
2017-05-16T10:33:17.591-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRangesMigrationPeriodical] periodical, running forever.
2017-05-16T10:33:17.592-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRangesCleanupPeriodical] periodical in [15s], polling every [3600s].
2017-05-16T10:33:17.595-05:00 INFO [connection] Opened connection [connectionId{localValue:6, serverValue:162}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:17.595-05:00 INFO [connection] Opened connection [connectionId{localValue:7, serverValue:163}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:17.599-05:00 INFO [connection] Opened connection [connectionId{localValue:5, serverValue:161}] to 10.XXX.XXX.XXX:27017
2017-05-16T10:33:17.644-05:00 INFO [PeriodicalsService] Not starting [org.graylog2.periodical.UserPermissionMigrationPeriodical] periodical. Not configured to run on this node.
2017-05-16T10:33:17.644-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.AlarmCallbacksMigrationPeriodical] periodical, running forever.
2017-05-16T10:33:17.644-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.ConfigurationManagementPeriodical] periodical, running forever.
2017-05-16T10:33:17.646-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.LdapGroupMappingMigration] periodical, running forever.
2017-05-16T10:33:17.647-05:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexFailuresPeriodical] periodical, running forever.
2017-05-16T10:33:17.647-05:00 INFO [Periodicals] Starting [org.graylog.plugins.usagestatistics.UsageStatsNodePeriodical] periodical in [300s], polling every [21600s].
2017-05-16T10:33:17.648-05:00 INFO [Periodicals] Starting [org.graylog.plugins.usagestatistics.UsageStatsClusterPeriodical] periodical in [300s], polling every [21600s].
2017-05-16T10:33:17.650-05:00 INFO [PeriodicalsService] Not starting [org.graylog.plugins.pipelineprocessor.periodical.LegacyDefaultStreamMigration] periodical. Not configured to run on this node.
2017-05-16T10:33:17.650-05:00 INFO [Periodicals] Starting [org.graylog.plugins.collector.periodical.PurgeExpiredCollectorsThread] periodical in [0s], polling every [3600s].
2017-05-16T10:33:17.718-05:00 INFO [IndexerClusterCheckerThread] Indexer not fully initialized yet. Skipping periodic cluster check.
2017-05-16T10:33:17.890-05:00 INFO [V20161130141500_DefaultStreamRecalcIndexRanges] Cluster not connected yet, delaying migration until it is reachable.
2017-05-16T10:33:17.911-05:00 INFO [transport] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] publish_address {127.0.0.1:9350}, bound_addresses {[::1]:9350}, {127.0.0.1:9350}
2017-05-16T10:33:17.916-05:00 INFO [discovery] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] graylog-prod/HPPt87SoSmewgUKqwfs6UQ
2017-05-16T10:33:18.054-05:00 INFO [JerseyService] Enabling CORS for HTTP endpoint
2017-05-16T10:33:20.919-05:00 WARN [discovery] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] waited for 3s and no initial state was set by the discovery
2017-05-16T10:33:20.919-05:00 INFO [node] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] started
2017-05-16T10:33:21.059-05:00 INFO [service] [graylog-d2636a89-c572-4717-b7e8-f7cada1266c6] detected_master {Neophyte}{p-qRoaYrS7Gb0yfRMgky9g}{10.XXX.XXX.XXX}{10.XXX.XXX.XXX:9300}, added {{Neophyte}{p-qRoaYrS7Gb0yfRMgky9g}{10.XXX.XXX.XXX}{10.XXX.XXX.XXX:9300},{graylog-bca2dfe8-77c8-4d01-8495-d08f7bdea82f}{TgllDbp2TOOcMijSrWLE1g}{10.XXX.XX.XX}{10.170.70.17:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{Neophyte}{p-qRoaYrS7Gb0yfRMgky9g}{10.XXX.XXX.XXX}{10.XXX.XXX.XXX:9300}])
2017-05-16T10:33:25.239-05:00 INFO [NetworkListener] Started listener bound to [10.XXX.XXX.XXX:12900]
2017-05-16T10:33:25.241-05:00 INFO [HttpServer] [HttpServer] Started.
2017-05-16T10:33:25.241-05:00 INFO [JerseyService] Started REST API at <http://10.XXX.XXX.XXX:12900/>
2017-05-16T10:33:27.199-05:00 INFO [NetworkListener] Started listener bound to [127.0.0.1:9000]
2017-05-16T10:33:27.199-05:00 INFO [HttpServer] [HttpServer-1] Started.
2017-05-16T10:33:27.200-05:00 INFO [JerseyService] Started Web Interface at <http://127.0.0.1:9000/>
2017-05-16T10:33:27.201-05:00 INFO [ServiceManagerListener] Services are healthy
2017-05-16T10:33:27.202-05:00 INFO [ServerBootstrap] Services started, startup times in ms: {KafkaJournal [RUNNING]=9, InputSetupService [RUNNING]=11, JournalReader [RUNNING]=19, OutputSetupService [RUNNING]=25, BufferSynchronizerService [RUNNING]=25, ConfigurationEtagService [RUNNING]=47, PeriodicalsService [RUNNING]=99, StreamCacheService [RUNNING]=258, IndexerSetupService [RUNNING]=3514, JerseyService [RUNNING]=9634}
2017-05-16T10:33:27.204-05:00 INFO [ServerBootstrap] Graylog server up and running.
2017-05-16T10:33:27.209-05:00 INFO [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2017-05-16T10:33:27.226-05:00 INFO [InputStateListener] Input [Syslog UDP/58f4bfb00ed1c97c79d8f863] is now STARTING
2017-05-16T10:33:27.264-05:00 WARN [NettyTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP Input, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} should be 262144 but is 212992.
2017-05-16T10:33:27.270-05:00 INFO [InputStateListener] Input [Syslog UDP/58f4bfb00ed1c97c79d8f863] is now RUNNING
I've finally found an interesting correlation with a repeatable pass situation.
This really makes no sense to me unless binding graylog on 127.0.0.1:9000 was somehow messing up the outbound network within the java server (hanging up the notification?). We don't see any http calls failing, but there was a noticeable removal of erratic behavior once nginx was removed from infront of the web interface (aka. we bound web_listen_uri to the private interface and killed off nginx)
I've attached the proxy + graylog config we were using.
avongluck
i have the same issue with opsgenie but i have no nginx in front of the graylog server.
could you please paste the graylog server conf so i can compare it to mine. Thanks
this issue has been driving us nuts.
The old alerting system has been replaced with the alerts and events system in the meantime. This issue shouldn't happen anymore. Please open a new issue if this is still a problem. Thank you!
We see an issue around the OpsGenie plugin for Graylog preventing all alert conditions from firing in 2.2.2 (with zero error messages).
After uninstalling the OpsGenie plugin and restarting Graylog the issue went away and alert conditions in streams started working. We've had an issue open to OpsGenie for a week now on it.