brexhq / substation

Substation is a toolkit for routing, normalizing, and enriching security event and audit logs.
https://substation.readme.io
MIT License
330 stars 21 forks source link

Support for Unit Testing of Transforms in Substation #239

Closed britton-from-notion closed 1 month ago

britton-from-notion commented 1 month ago

Is your feature request related to a problem? Please describe.

When looking into adopting Substation as a potential solution, one of the pieces that made it difficult to commit fully was the absence of built-in unit testing to ensure that transforms function as expected prior to being deployed.

I was hoping to see unit tests that would allow me to write a transform for a given data source and on PR or as part of some sort of command line utility, execute that unit test to verify that my transform is in a working state. This test would ideally give a success or failure output right in the console and an exit code. This would help me to trust my changes as well as expedite the collaboration process with teammates.

Describe the solution you'd like One of the components that I think could really enhance adoption and utility of Substation, is integrated support for unit testing transforms.

What I would hope to see is the ability to specify a unit test very close to the transform itself, whether in the same file or a neighboring file. where you could assert things like field foo matches string bar, field foo is number - or any of the other conditional statements that are baked into the substation transform library could be used as validation checks to guarantee the output field matches the assertion.

Some ideas for what this could look like might be from chatting about this on slack (100% @jshlbrd's cool idea here):

// Toy example of an embedded unit testing framework that reuses existing
// Substation packages. The `substation-test` app has this workflow:
// 1. Compile `transforms` (which modify data).
// 2. Compile `tests[].transforms` (which create input test data).
// 3. Compile `tests[].condition` (which checks the output test data).
// 4. Run the tests against the `transforms`.
local sub = import '../../../substation.libsonnet';

local utest = [
  sub.tf.obj.insert({ obj: { target_key: 'foo' }, value: 'bar' }),
  sub.tf.send.stdout(),  // Optional: print the input data to stdout for debugging.
];

{
  tests: [
    {
      name: "substation-pass",
      // Use transforms to create input data for the test.
      transforms: utest,
      // Apply a condition to the output data to check the test result.
      condition: sub.cnd.str.eq({obj: { source_key: 'baz'}, value: 'qux' }),
    },
    {
      name: "substation-fail",
      // Use transforms to create input data for the test.
      transforms: utest,
      // Apply a condition to the output data to check the test result.
      condition: sub.cnd.num.len.eq({value: 0 }),
    },
  ],
  // Each test is run against these transforms.
  transforms: [ 
    sub.tf.obj.insert({ obj: { target_key: 'baz' }, value: 'qux' }),
  ]
}

I like that the data, tests, and transforms are all baked into a singular file making the connection between tests and their business logic very transparent.

Where my idea might deviate from this is that I would hope to see this functionality built into a single substation binary. I would prefer to be able to write substation test *.jsonnet directly in my CLI and also be able to write substation build *.jsonnet to build a substation app or substation run *.json to run it (this could be opened as a separate issue, since I know this deviates from how substation works. But a central, vended, entrypoint for my substation workflow including unit testing mentally clicks a lot better in my head than a dedicated unit testing CLI .)

Describe alternatives you've considered

The alternatives I've considered are primarily built into other tools. A great example is Vector's built-in support for unit testing with their VRL language. In Vector, you can easily map an input source, a transform, and an output destination for your unit test. If the input data matches your output assertions, the unit test passes.

This feature allows you to create detailed, comprehensive unit tests that validate the presence and structure of each expected field. It is also run through the same executable that is used to run the pipelines, just a different CLI entrypoint - making the workflow super easy.

Additional context https://vector.dev/docs/reference/configuration/unit-tests/

Example of a unit test in Vector that validates the existence of fields from the incoming data.

source: |
  assert!(exists(.message), "no message field provided")
  assert!(!is_nullish(.message), "message field is an empty string")
  assert!(is_string(.message), "message field has as unexpected type")
  assert_eq!(.message, "success", "message field had an unexpected value")
  assert!(exists(.timestamp), "no timestamp provided")
  assert!(is_timestamp(.timestamp), "timestamp is invalid")
  assert!(!exists(.other), "extraneous other field present")
jshlbrd commented 1 month ago

Thanks @britton-from-notion for the detailed write-up! It shouldn't take much effort to get this working because the necessary parts already exist.

Where my idea might deviate from this is that I would hope to see this functionality built into a single substation binary. I would prefer to be able to write substation test .jsonnet directly in my CLI and also be able to write substation build .jsonnet to build a substation app or substation run *.json to run it (this could be opened as a separate issue, since I know this deviates from how substation works. But a central, vended, entrypoint for my substation workflow including unit testing mentally clicks a lot better in my head than a dedicated unit testing CLI .)

Agreed that this is a separate issue. I think what you're describing is a utility application for managing source code and configurations. Most of the apps under cmd/development could fit within that, with some changes. For example:

I'm not sure what the behavior of substation run would be, but it might encapsulate the existing file and Kinesis development apps. Since deployments are decentralized and managed by Terraform, it's unlikely that it would ever support a command like substation deploy.