accordproject / concerto

Business schema language and runtime
https://concerto.accordproject.org
Apache License 2.0
122 stars 108 forks source link

Compression for serialized objects #886

Open dselman opened 4 months ago

dselman commented 4 months ago

Feature Request 🛍️

Support compression of serialised Concerto objects.

Use Case

ASTs and serialised objects in general are verbose. They compress well due to repeated JSON properties, like $class.

Possible Solution

Provide compress/decompress functions within Concerto core or util.

Context

Detailed Description

Two approaches, which may be complimentary have been explored.

Class Map

This specifically targets the $class properties within the JSON objects produced by the Serializer. The JSON tree is visited to build a Map of all $class values in the JSON. $class entries that start with the same prefix as the root $class are shortened by removing the common prefix.

This map is used to replace the $class properties with indexes into the map, resulting in a JSON object that looks like:

{
  "$class": "1",
  "models": [
    {
      "$class": "2",
      "decorators": [],
      "namespace": "test@1.0.0",
      "imports": [],
      "declarations": [
        {
          "$class": "3",
          "name": "SSN",
          "location": {
            "$class": "4",
            "start": {
              "offset": 22,
              "line": 3,
              "column": 1,
              "$class": "5"
            },
            "end": {
              "offset": 124,
              "line": 9,
              "column": 1,
              "$class": "5"
            }
          },
}],
"$version": 1,
  "$classMap": {
    "1": ".Models",
    "2": ".Model",
    "3": ".StringScalar",
    "4": ".Range",
    "5": ".Position",
    "6": ".Decorator",
    "7": ".ConceptDeclaration",
    "8": ".StringProperty",
    "9": ".DecoratorString",
    "10": ".ObjectProperty",
    "11": ".TypeIdentifier",
    "12": ".IntegerProperty",
    "13": ".MapDeclaration",
    "14": ".StringMapKeyType",
    "15": ".StringMapValueType",
    "16": ".EnumDeclaration",
    "17": ".EnumProperty"
  },
  "$prefix": "concerto.metamodel@1.0.0"
}

LZ Compression

LZ compression is used on the JSON object (either the source object as-is, or the object after the Class Map has been built). Resulting in a JSON object that looks like:

{
  "compressed": "ᯡࠩƬ΀䌦㧤Ɛ䄣氧ァ☢㠥暠㨡㛻熤娠䷒䀠䁦ᄠ᥺၌䛛ࠣK嚴≄ú ",
  "format": "LZ_UTF16"
}

Results

Class Map: approximately 1.6x compression ClassMap + LZ: approximately 12x compression Just LZ: approximately 10x compression

DS-AdamMilazzo commented 4 months ago

For LZ compression, how is the byte stream converted into a string in your example? I think you'd want to consider two things.

That said, if you're considering LZ-type compression at all, you may consider storing the result natively in binary if the storage system can handle it, rather than encoding it into a string and then encoding the string into JSON and then encoding the JSON into UTF-8. Storing as binary wastes 0% of the bits. Cosmos DB supported binary attachments, but it's deprecated and they recommend moving to Azure Blob Storage instead; that has the downside of needing to talk to two services. You might consider using a different database than Cosmos DB if you're going to be storing a lot of binary data.

For the class map: