Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.36k stars 1.06k forks source link

Explicitly define when to first trigger the search when creating event definitions #11166

Open boogity opened 3 years ago

boogity commented 3 years ago

What?

Provide a mechanism for manually setting the search time when creating event definitions. It doesn't look like there's currently any method for defining when Graylog should first begin searching for an event.

Why?

I imagine this is probably a somewhat niche desire when most event definitions are set to "search within every x seconds" or "y minutes" but I have some searches that, for operational reasons, can only be run every 24 hours. I monitor logs for some embedded readers on the international space station that are only downlinked from orbit 1x/daily. Ideally I'd like the event definition to let me explicitly define when to first begin searching through logs for a defined event. This way I can ensure we get email alerts in a timely manner and know that updating/modifying the event definition doesn't change when the email alert is generated. See my Graylog community post for more details

Your Environment

kroepke commented 3 years ago

Hey @boogity Do I understand correctly that this is both about understanding when it will run the job and also specifying when the window "starts" (i.e. clamping it down to let's say midnight)?

Thanks!

boogity commented 3 years ago

HI @kroepke sorry for the late response, been out sick the last few days. Yes, I think you understand exactly right. The "perfect" feature I'm imagining would let you both specify when the window starts and provide some feedback on when the job will run.

Unrelated to the feature request, if there's a means of knowing exactly when a search will fire off now in Graylog I'd love to know what it is. I poked around for a while in MongoDB and found things that seemed promising (db.scheduler_job_definitions && db.event_definitions) but the data in those entries didn't match up with when I could empirically confirm the search was happening. No problems if you don't have an answer to this. I know it's outside the scope of this feature request.

kroepke commented 3 years ago

@boogity Hope you are feeling better!

Thanks for the confirmation. As to your question, there are two components to this system: 1) The event definition itself, containing all config relating to what the search/aggregation is, fields, notifications, etc. 2) The job schedule/job trigger part, which is being used by the scheduler to determine what to run next.

When an event definition is scheduled (i.e. it is created as enabled, or enabled later), a job trigger is created with a target time of "now". The scheduler will then run it at the next possible time (depends on thread pool usage, whether it's running in cluster mode or not, and of course it's polling interval).

The knack here is the following: After each run is triggered, the scheduler will ask the job when it needs to run again. Graylog currently has two schedules: Once and Interval. The latter simply adds the configured interval to the last target time (i.e. the scheduled time, not the actual time it ran, which might be later).

I'd be curious to know what you are seeing in terms of odd times. The only explanation I have is that the single node that runs the jobs doesn't have enough threads to do so (in Enterprise Graylog uses the entire cluster of machines to schedule jobs). This could potentially cause jobs to be delayed, however, Graylog makes sure to adjust the search times so they are all correct and you aren't missing any intervals, it's just that it will take longer to catch up. You can check Graylog's metrics for something named JobWorkerPool and look at the gauges in there (waiting_for_slots, free_slots, total_slots) to get an idea of whether it's backed up.

I believe the thread pool size is currently hardcoded but would need to verify with some folks.

lordmundi commented 3 years ago

I think one of the fundamental issues is that perhaps there is an assumption that the data is streaming into graylog somewhat live. In this case, the spacecraft could be without signal for a day or two and then we download 2 days worth of logs. Also the time of day those logs arrive are unknown ahead of time as they aren't on a schedule. So once those 2 days of logs are loaded, we need some way to search through the new 2 days of data we just loaded. And the next time we load data (again unknown) we don't want to search and send duplicate alerts. Hopefully that makes sense.