anderseknert / roast

Roast is an optimized JSON format for Rego ASTs, as well as some common utilities for working with it.
Apache License 2.0
4 stars 0 forks source link
ast golang golang-library library opa open-policy-agent regal rego static-analysis

Roast (Regal's Optimized AST)

Roast is an optimized JSON format for Rego ASTs, as well as some common utilities for working with it.

Roast is used by Regal, where the JSON representation of Rego's AST is used input for static analysis performed by Rego itself to determine whether policies conform to Regal's linter rules.

Goals

While this module provides a way to encode an ast.Module to an optimized JSON format, it does not provide a decoder. In other words, there's no way to turn optimized AST JSON back into an ast.Module (or other AST types). While this would be possible to do, there's no real need for that given our current use-case for this format, which is to help work with the AST efficiently in Rego. Roast should not be considered a general purpose format for serializing the Rego AST.

Differences

The following section outlines the differences between the original AST JSON format and the Roast format.

Compact location format

The perhaps most visually apparent change to the AST JSON format is how location attributes are represented. These attributes are everywhere in the AST, so optimizing these for fast traversal has a huge impact on both the size of the format and the speed at which it can be processed.

In the original AST, a location is represented as an object:

{
  "file": "p.rego",
  "row": 5,
  "col": 1,
  "text": "Y29sbGVjdGlvbg=="
}

And in the optimized format as a string:

"5:1:5:11"

The first two numbers are present in both formats, i.e. row and col. In the optimized format, the third and fourth number is the end location, determined by the length of the text attribute decoded. In this case Y29sbGVjdGlvbg== decodes to collection, which is 10 characters long. The end location is therefore 5:11. The text can later be retrieved when needed using the original source document as a lookup table of sorts.

While this may come with a small cost for when the location is actually needed, it's a huge win for when it's not. Having to split the result and parse the row and column values when needed occurs some overhead, but only a small percentage of location attributes are commonly used in practice.

Note that the file attribute is omitted entirely in the optimized format, as this would otherwise have to be repeated for each location value. This can easily be retrieved by other means.

"Empty" rule and else bodies

Rego rules don't necessarily have a body, or at least not one that's printed. Examples of this include:

package policy

default rule := "value"

map["key"] := "value"

collection contains "value"

OPA represents such rules internally (that is, in the AST) as having a body with a single expression containing the boolean value true. This creates a uniform way to represent rules, so a rule like:

collection contains "value"

Would in the AST be identical to:

collection contains "value" if {
    true
}

And in the OPA JSON AST format:

{
  "body": [
    {
      "index": 0,
      "location": {
        "file": "p.rego",
        "row": 5,
        "col": 1,
        "text": "Y29sbGVjdGlvbg=="
      },
      "terms": {
        "location": {
          "file": "p.rego",
          "row": 5,
          "col": 1,
          "text": "Y29sbGVjdGlvbg=="
        },
        "type": "boolean",
        "value": true
      }
    }
  ],
  "head": {
    "name": "collection",
    "key": {
      "location": {
        "file": "p.rego",
        "row": 5,
        "col": 21,
        "text": "InZhbHVlIg=="
      },
      "type": "string",
      "value": "value"
    },
    "ref": [
      {
        "location": {
          "file": "p.rego",
          "row": 5,
          "col": 1,
          "text": "Y29sbGVjdGlvbg=="
        },
        "type": "var",
        "value": "collection"
      }
    ],
    "location": {
      "file": "p.rego",
      "row": 5,
      "col": 1,
      "text": "Y29sbGVjdGlvbiBjb250YWlucyAidmFsdWUi"
    }
  },
  "location": {
    "file": "p.rego",
    "row": 5,
    "col": 1,
    "text": "Y29sbGVjdGlvbg=="
  }
}

Notice how there's 20 lines of JSON just to represent the body, even though there isn't really one!

The optimized Rego AST format discards generated bodies entirely, and the same rule would be represented as:

{
  "head": {
    "location": "5:1:5:11",
    "name": "collection",
    "ref": [
      {
        "location": "5:1:5:11",
        "type": "var",
        "value": "collection"
      }
    ],
    "key": {
      "type": "string",
      "value": "value",
      "location": "5:21:5:27"
    }
  },
  "location": "5:1:5:11"
}

Note that this applies equally to empty else bodies, which are represented the same way in the original AST, and omitted entirely in the optimized format.

Similarly, Roast discards location attributes from attributes that don't have an actual location in the source code. An example of this is the data term of a package path, which is present only in the AST.

Removed annotations attribute from module

OPA already attaches annotations to rules. With the Roast format attaching package and subpackages scoped annotations to the package as well, there is no need to store annotations at the module level, as that's effectively just duplicating data. Having this removed can save a considerable amount of space in well-documented policies, as they should be!

Removed index attribute from body expressions

In the original AST, each expression in a body carries a numeric index attribute. While this doesn't take much space, it is largely redundant, as the same number can be inferred from the order of the expressions in the body array. It's therefore been removed from the Roast format.

Removedname attribute from rule heads

The name attribute found in the OPA AST for rules is unreliable, as it's not always present. The ref attribute however always is. While this doesn't come with any real cost in terms of AST size or performance, consistency is key.

Fixed inconsistencies in the original Rego AST

A few inconsistencies exist in the original AST JSON format:

Fixing these in the original format would be a breaking change. The Roast format corrects these inconsistencies, and uses text and location consistently.

Performance

While the numbers may vary some, the Roast format is currently about 40-50% smaller in size than the original AST JSON format, and can be processed (in Rego, using walk and so on) about 1.25 times faster.

Potential improvements

Replace text in location string with end location

While it's not known how much of an impact it has on traversal speed / performance, a large chunk of the bytes in the current optimized AST are base64-encoded text values in location strings. These could be replaced with the end location of any given node. Rather than base64-encoding the the bytes of the text in serialization, we could instead count the number of newlines in the text, and from that plus the number of bytes on the last line (if more than one) determine the end location. It would then be assumed that the client has the means to translate that into the equivalence of text where necessary. Regal could for example easily do this from input.regal.file.lines.