TurboTurtle / rig

A lightweight, flexible, easy to use system monitoring and event handling utility
GNU General Public License v2.0
10 stars 7 forks source link

Rig v2 - new design, renewed focus on applicability of project #51

Open TurboTurtle opened 1 year ago

TurboTurtle commented 1 year ago

It's been a long time since rig has received any attention or updates. That's completely on me as I've been tied up with other projects at work that have taken me away from here.

In the interim however, I've been scoping out a new design for rig that makes it easier to extend and maintain, while also being easier on end users and more in-line with other modern tools.

New design

As things currently stand, rig is designed around the concept of "a rig watches for one condition and then does one or many actions in response". While simple in concept, the underlying code for building "a rig" was...not the cleanest design. CLI commands were conflated with the handling of rig creation at a fundamental level, which leads to extensibility issues.

The new design changes this by instead making a rig "the" backgrounded process from which one or many "monitors" may be launched and when any of those monitors detect their specified conditions, trigger one or many actions in response." In other words, whereas before we would have "a logs rig" that watches for specific messages, we now have "a rig that monitors logs for a message, and possibly monitors other things as well".

By making this abstraction, we can also re-arrange a number of code flows that makes it easier to establish new commands/new functionality, without having as large of knock-on effects on the whole project.

Further, with so many rigs specifying rig-specific options, the CLI experience was frankly, painful. One rig may use the --interface option, while another used --iface and another may have needed to use --ifname to all reference the same physical piece of hardware.

v2 will resolve this by transitioning to using yaml-formatted configuration files, or "rigfiles". Most similar to ansible playbooks, these rigfiles will serve as the way to configure new rigs on a system. Rather than having a CLI subcommand for each rig type, there will simply be a main rig create command which will specify a rigfile to use, and then we will parse that rigfile to configure and deploy a rig as described.

By moving to this kind of deployment model, we simplify a number of aspects of the project:

An example of a rigfile that would have previously been deployed by a CLI command of rig logs --interval 5 --message 'this is my test message' --logfile='/var/log/messages','/var/log/foo/bar.log' --journals='foo' --count=5 --sosreport --only-plugins kernel,logs --initial-sos would be:

name: myrig
interval: 5
monitors:
  logs:
    message: this is my test message
    files:
      - /var/log/messages
      - /var/log/foo/bar.log
    journals: foo
    count: 5
actions:
  sos:
    report:
      initial_sos: true
      only_plugins:
        - kernel
        - logs

Which is far more grokable, and more reusable.

I am beginning these changes on the rig-v2 branch and will be working my way through transitioning the various monitors, actions, and commands to this new design. Once done, I'll flip the changes over to master (or rather, it will be main at that point most likely).

Comments, feedback, and suggestions are surely welcome.

TurboTurtle commented 1 year ago

@juanmasg I'm not sure if you're still interested in this project or not - but I've created the base create command, ported the existing rigs to monitors and ported the noop action to use for testing at this point if you'd like to take a look. The packet monitor has been working in my local tests, but there's definitely areas of it I haven't been able to test directly.

Docs have yet to be updated, so check the git logs if you're interested. A basic rigfile for the new packet monitor would be something like:

name: mypacketrig
monitors:
  packet:
    interface: eth0
    tcpflags:
      - FIN
      - RST
actions:
  noop:
    enabled: true

Which you would then feed to rig via rig create -f mypacketrig.yaml.

juanmasg commented 1 year ago

@TurboTurtle yes, I'm still interested. I've been quite busy lately with work and some personal stuff, but just this week I was thinking about how to resume contributing to this project.

It's great to read about the new changes, they will also help a lot with all the testing when writing new rigs. I'll try to get familiar with the new code and start working on some ideas I have in mind and also on all the other networking related rigs we discussed about.

TurboTurtle commented 1 year ago

I've just completed porting the existing actions to the new design. In no particular order I still need to do the following before merging the rig-v2 branch to main:

  1. Pivot the repo from master to main ;)
  2. Port the info command
  3. Port the trigger command
  4. Port the destroy command
  5. Port the list command
  6. Update docs

2 thru 5 will all generally require the ability to communicate with the backgrounded processes via the rig's socket. For this version, I'm wanting to do that using pickle instead of the current implementation which considered pickles to be heavy-handed for what we were doing.

juanmasg commented 1 year ago

@TurboTurtle I've been testing the packet monitor from rig-v2, everything seems to be working as expected.

I'm working on the new networking related monitors, all the current work is available here, but let me know if you need a hand with porting all the code to the new design or the new communication protocol.

TurboTurtle commented 1 year ago

As a small tracking update, as of #59 the subcommands are all now ported. After some internal discussions there is a desire to have the ability to have a repository of sorts for rigfiles. This way support engineers could build reference configurations for known issues, instead of relying on providing the raw rigfiles to customers.

This leaves, again in no particular order, the following:

As interesting as the repo work is, by far the most critical is the test suite. Right now it's just been local per-PR testing, but we need something more flexible and scalable. Like sos, the problem we run into here is that our project not only requires root privileges, but also has specific requirements to validate testing that can make automated test suites more difficult, or at least more fragile.

I think what I'm going to do personally is plan out a test suite design while working through the docs update. By the time docs are done (lol, sure) I should at least have a decent idea of what a test suite would look like. Can we get by with the base unittest? Do we need to use something like avocado-framework? I'm 99.9% certain testing environment infrastructure won't be a problem, because we can run this on GCP courtesy of Red Hat. I think leveraging Cirrus CI as the orchestration mechanism there is also the smart play - but I'm certainly open to hearing other opinions (tagging @juanmasg for visibility on this).

TurboTurtle commented 1 year ago

Right, went silent there for a moment. I have since changed employers and am no longer sure of the future of this project as it relates to EL distributions, which would be the main use case of something like rig.

Part of me says that if we don't have a distribution story for EL, then there isn't much point in continuing the project. Another part however sees that this project could theoretically be picked up elsewhere, provided there is some form of evangelism for it.

At the moment, v2 is close to being ready for a formal release, so let me jot down what's left and the general state of things.

Docs are almost complete, just missing a manpage entry for the packet monitor I believe. We could release v2 missing this particular doc if needed however (not great, but it's not release blocking imo).

A hosted repo solution is a ways off, I think. Somewhat ironically, the easiest way to facilitate rig usage being taken up is by this very integration. I am unsure if I have the time to dedicate to creating the possiblity of a hosted repo solution without it being part of my day to day employment.

Jinja templating is also mainly useful when used in conjunction with a repo. It buys us some limited flexibility locally, but ideally something like rig create -t some_rig_repo/some_template --vars /some/vars/file and/or a local yaml file that includes some_template from a hosted repo, and defines the vars there is what really drives the usefulness of jinja integration.

If we don't have a distribution path with EL, we also intrinsically lose the GCP access from RH. Any CI testing would be out of pocket, which I have mixed feelings about mainly due to the concern above about people and teams actually using rig.