cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

Event handling functions (for central reporting). #2442

Open hjoliver opened 6 years ago

hjoliver commented 6 years ago

Several sites (including mine) have a need for a central DB of routine events across all suites, to enable full-system analysis and reports without having to know where all the operational suites are.

IMO log scraping (or similarly, suite DB scraping) is not a good idea because the scraper program would need to know about all suites; our suite log content is not well standardised; continual DB reads might affect suite performance; suite logs get rolled over and logs and DBs get obliterated on cold start (so events could be missed if there's a scraper outage); and the scraper program itself would need monitoring etc.

We could use the existing event handlers for this, but they may be too heavy for reporting every task event [because each call executes a script in a subshell].

So, I propose we allow suite daemons to push routine event data to user-defined functions that know what to do with it (e.g. publish an event message to Kafka, or write to a central DB, or write to syslog). [by "user-defined" I mean the core functionality is that Cylc passes a defined data structure to a defined function interface - but what the function does with it is up to the user (or site) defined application - although we could supply some built-in examples, e.g. to write to syslog]

This would be easy to implement, and I think it avoids all of the above problems with log scraping.

(This is motivated by the same project as the new external event triggers, and I think this will be widely useful as well).

@cylc/core - do you agree?

hjoliver commented 6 years ago

I suppose this could be thought of as event handler functions (rather than scripts)... but the information provided could possibly go beyond events as such.

hjoliver commented 6 years ago

Assuming we agree, thoughts on implementation:

Synchronous calls (i.e. in the main process) to a "logging function" (need a better term?) would be trivial to implement, if we can assume each call takes negligible time. But we should probably:

matthewrmshin commented 6 years ago

Can we simply add a handler to our logger using one of Python standard library logging.handlers with some filters?

See also #386.

hjoliver commented 6 years ago

I think std lib logging is only for logging to local files, no? For the purposes of this issue I'm using the term "logging" in the loosest possible sense, as in "the kind of information that typically gets logged" - but the point is, the central "log" is likely to be a DB, and it is likely to be on the other end of a message broker (e.g. Kafka) that aggregates information from multiple suites and other sources such as a PBS log scraper. That being the case, it seems to me we need "plugin" functions that receive the data and send it wherever (Kafka in BoM case), OR we need to log (or suite-db) scrape all suites - which has all the problems I mentioned above.

matthewrmshin commented 6 years ago

No, the logging library is very extensible. Even with just the standard set of handlers, you can send logs to system log or to a socket. The logging.Handler class can also be extended to do anything, so I think it is best to exploit that instead of creating a custom protocol.

hjoliver commented 6 years ago

OK, that's interesting. I'll look into this later ...

hjoliver commented 6 years ago

@matthewrmshin - on reflection, I'm not convinced on the logging library suggestion. I'm really proposing something simpler and more general. And it doesn't involve creating a custom protocol.

I envisage simply sending (periodically, as "loggable" events occur) a data structure of event data to a user-designated function that can do what it likes with the data. As far as Cylc is concerned that's it, job done, except for one thing: this function could be called a large number of times and we can't be sure that it will return super quickly, hence my musings about queuing calls to a background process (or pool).

It may be appropriate to use the Python logging library inside one of these functions, but that is up to the user or site. Although I suppose we could supply a built-in function for logging to syslog.

If the intent is to send data to a central reporting DB via a message broker, the "message" formulated inside the function, from the event data, will likely not even be a string (e.g. a list of DB column data).

matthewrmshin commented 6 years ago

@hjoliver OK, I guess I got confused by the phrase logging here. I can now see that you are really talking about pushing data, on events, to a set of targets (or listeners? or observers?)

However, I am probably still missing the point here on event handlers. Can we not just have another built-in event handler that can do this sort of stuff? The email notification built-in event handler is effectively something like this - with multiple events being grouped together in a single message - the receiving end happens to be an SMTP server and the message happens to be a formatted email, but these can be anything really. (We'll probably need to refactor the event handler logic somewhat so all the different types of event handlers can have their own extension points in a plugin architecture.)

hjoliver commented 6 years ago

@matthewrmshin - fair enough, I can see how the word "logging" might have led you to believe I was actually talking about logging :grinning: [UPDATE: better title added to issue]

Maybe I'm wrong, but I was concerned that event handlers - being executables launched in a sub-shell - are too heavy-weight to use for every event (this is for routine events, not exceptional events).

Hence why I've suggested using functions rather than scripts. As per my comment above https://github.com/cylc/cylc/issues/2442#issuecomment-335304902 my proposal essentially amounts to event handler functions (presumably lighter weight than scripts: just the Python pool process with no additional execution of a standalone script in a sub-shell).

In fact I've already added the capability to execute functions in the process pool, in #2423.

We could additionally allow aggregation (like the emails as you say) over some interval.

So I think we are actually in agreement now, if you agree to handler functions instead of (or as well as) scripts.

... plugin architecture

We should talk more about this via email. Here, and in #2423, I'm using the term "plugin" very loosely: you can make and activate a new plugin by simply writing a new function of the right form and putting it in the right place.

matthewrmshin commented 6 years ago

OK.

I can see that we can probably do the same for something like the GUI. A suite currently generates the suite state summary regardless of whether we have a GUI connected to the suite or not. It would be nice if only do so when we have a connecting GUI. A GUI will start up a listener and ask the suite to push data to it on event. A suite at quiet time will no longer get polled by connecting GUIs all the time.

About plugin. I think we are in the same wavelength here. I am really talking about a common interface for a set of functional modules. I am not suggesting a system for plugin installation.

hjoliver commented 6 years ago

we can probably do the same for something like the GUI. ..

That is is a good idea! I had not thought of that.

I am not suggesting a system for plugin installation.

I was just wondering if you were thinking of some kind of "registration" system, where the user determines what plugins are activated. However, I guess you weren't, and on reflection, in this context that would be pointless because "activated" just means available for use, not necessarily being used.

hjoliver commented 6 years ago

[Description and title above updated for - hopefully - better clarity]

hjoliver commented 6 years ago

@matthewrmshin - this proposal basically amounts to supporting function (in addition to script) event handlers, with the functions called asynchronously in the process pool in case they're a bit slow. I am assuming this would provide a significant performance advantage under heavy use (e.g. for reporting all events routinely) even when executing these functions in the process pool - would you agree that's a valid assumption?

matthewrmshin commented 6 years ago

The assumption is mostly likely correct.