Consider Avro Score format changes

Current Implementation

The current aloha_avro_score.avdl uses Avro's support for co-products based on the built-in union type. As such, the description of a score looks like:

record Score {
    union { null, ModelId } model = null;

    union {
      null, boolean, int, long, float, double, string,
      array<union{boolean, int, long, float, double, string}>
    } value = null;

    union { array<Score>, null } subvalues = [];
    union { array<string>, null } errorMsgs = [];
    union { array<string>, null } missingVarNames = [];
    union { null, float } prob = null;
  }

Deficiencies

There are some deficiencies with this approach. Using the avro-tools tool chain to compile this Java code results in a Score.value of type Object since Java has poor support for co-products.

Most non-default tool chains don't support union types with 3 or more constituent types and only support 2-type unions when one of the types is null. The support for the latter is done to encode Option types in Scala.

Additionally, union types in Avro can't support a union of different arrays. Therefore, array types appearing in a union must be mixed, even if that's not desired (as it's a more lax type).

Proposal

This proposal is to make the Avro Score message more like the protobuf version that encodes co-products via multiple nullable variables, one per desired output type and a type indicator variable.

While the type indicator variable isn't strictly necessary, it can speed lookups from O(T) (where T is the number of types) to O(1). Obviously, for this to be type-safe, we need type classes (one instance per possible type) to extract the value from the Score.

This approach will work better with current tool chains, which will allow us to not have to special case Aloha scores.

Avro IDL for Aloha Score

So, the proposed Avro description of Score would be something like:

record Score {
    union { null, ModelId } model = null;

    enum ValueType {
        NO_VALUE,

        BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING, 

        ARRAY_BOOLEAN, ARRAY_INT, ARRAY_LONG, ARRAY_FLOAT, 
        ARRAY_DOUBLE, ARRAY_STRING
    }

    /*
      Indicates which one of the *Value fields are populated.  If the model 
      could not produce a value, value_type should be set to NO_VALUE.
    */
    union { null, ValueType      } value_type = null;

    /* 
      Value types:  Since this is a representation of a coproduct, either 0 or
      1 of the following should be populated.  If 0 *Value fields are 
      populated, the value_type field should be NO_VALUE.  Otherwise, 
      value_type should be set to the type whose *Value field is populated.
    */
    union { null, boolean        } booleanValue      = null;
    union { null, int            } intValue          = null;
    union { null, long           } longValue         = null;
    union { null, float          } floatValue        = null;
    union { null, double         } doubleValue       = null;
    union { null, string         } stringValue       = null;
    union { null, array<boolean> } booleanArrayValue = null;
    union { null, array<int>     } intArrayValue     = null;
    union { null, array<long>    } longArrayValue    = null;
    union { null, array<float>   } floatArrayValue   = null;
    union { null, array<double>  } doubleArrayValue  = null;
    union { null, array<string>  } stringArrayValue  = null;

    /*
      Previously, these three values had a default value of an empty list 
      rather than null, but certain consumers of Aloha have said this is 
      an issue with their Avro tooling.
    */
    union { null, array<Score>   } subvalues         = null;
    union { null, array<string>  } errorMsgs         = null;
    union { null, array<string>  } missingVarNames   = null;

    union { null, float          } prob              = null;
}

Aloha Score extensions

Scala Extraction Syntax

Like in the protobuf case for protobuf Scores, we should have type classes to extract the value. For simplicity, it would be nice to have get and apply syntax analogous to Scala's scala.collection.GenMapLike trait. This should be very easy to accomplish and would look like this

val s: Score = ???
val oi: Option[Int] = s.get[Int] // Safe.
val i: Int = s[Int]              // May throw java.util.NoSuchElementException.

This should involve importing the syntax, but the conversions should all be in the appropriate implicit scope so that converters don't need to be explicitly pulled into scope.

Additional Coproduct Support

Perhaps it's also useful to consider coproducts at the type level using an encoding with something like cats or iota.

Java Extraction Syntax

T.B.D.

Perhaps something a rich score wrapper with getT methods for each output type T. These should probably throw when the incorrect type is provided. A determination of up-casting should be considered. For instance, what happens when the consumers asks for a Double but the type was a Long or a Float. The current protobuf Score code does this but this should be readdressed at this time and the Avro and protobuf should be made consistent. Perhaps common interfaces could be extracted and we could have rich implementations that provide more seamless interoperability.

Implementation Details

Map types are coming to Avro soon for multilabel learning and this presents a problem. Quadratically many types in the key and value spaces will be required to encode this type of score coproduct. Therefore, it might be a good time to think about generating the Avro schema programmatically. This might be possible with something like the sbt-boilerplate plugin. If not, we might want to think about writing such a plugin.

Here's a question for the consumer. Is it easier to consume:

a score whose value has a polymorphic type? OR
a score with many key-value pairs of monomorphic type, where exactly zero or one key-value pair is populated? (zero for a "no prediction", one for a prediction). Note that a type hint could still allow for O(1) value extraction. It should be based on the @switch annotation. The byte code should be checked for a tableswitch instruction. See here for more details.

For instance:

Polymorphic Type

{ 
  "scores": [
    { "value": 1     }, 
    { "value": "car" }, 
    { "value": true  }
  ]
}

Many Monomorphic Types

{
  "scores": [
    { "intValue":    1,      "valueType": "int"     }, 
    { "stringValue": "car",  "valueType": "string"  }, 
    { "booleanValue": true,  "valueType": "boolean" }
  ]
}

Another question. What to do about null.

{ "value": null } vs {}. AND

{ "intValue": null, "valueType": "int" } vs { "valueType": "int" }.

I lean toward the latter in both scenarios.

Something like the following is also possible where each score type has an associated record type but this has icky syntax implications (but does allow homogenous arrays):

Avro

@namespace("com.github.deaktator.scores")
protocol ScoreProtocol {
  record IntScore { int _1 = 0; }
  record StringScore { string _1 = ""; }
  record DoubleVectorScore { array<double> _1 = []; }
  record Score {
    union { null, IntScore, StringScore, DoubleVectorScore } value = null;
  }
}

Calling code

val s = Score.newBuilder.setValue(IntScore.newBuilder.set_1(1).build).build
val value = s.value._1     //  <-- YUCK!

Again, this stuff could likely be made pretty with type classes, so it's not a really big deal, but this doesn't seem to provide much over the current implementation except for homogenous vector types.

Something like this might also be possible:

@namespace("com.github.deaktator.scores")
protocol ScoreProtocol {
  record ModelId {
    union { null, long } id = null; 
    union { null, string } name = null;
  }

  record IntScore {
    union { null, ModelId } modelId = null;
    union { null, int } value = null; 
    union { null, array<Score> } subvalues = null;
  }

  record DoubleVectorScore { 
    union { null, ModelId } modelId = null;
    union { null, array<double> } value = null; 
    union { null, array<Score> } subvalues = null;
  }

  record Score {
    union { null, IntScore, DoubleVectorScore } value = null;
  }
}

This last suggestion would be nice as it reflects the idea of RootedTreeAuditor; however, there many problems with this in Avro and it might not be feasible. For instance, here's just a few related issues:

AVRO-171: Protocol Introspection can produce invalid forward references [Fixed] (1.3.0)
AVRO-530: allow for mutual recursion in type definitions [Unresolved]
AVRO-1285: Avro schema compile does not support circular reference [Unresolved]
AVRO-1667: Parser symbol tree flattening is broken for recursive schemas [Fixed] (1.8.1)
AVRO-1723: Add support for forward declarations in avro IDL [Fixed] (unreleased)

All of these issues are presentations of the same problem: Avro doesn't seem to support mutually recursive types.

This ticket seems promising AVRO-1723. It has been merged to master but doesn't appear to be in any releases yet (as of 1.8.2) because the issue still exists.

Example of problem

Using Avro 1.8.2

avro-tools 2>&1 | head -n1 | ggrep -Po '\d(\.\d+)*'

outputs 1.8.2.

Schema

// mut_rec_protocol.avdl
@namespace("test")
protocol MutRecProtocol {
  record C1 {
    union { null, C2 } next = null;
  }

  record C2 {
    union { null, int } value = null;
    union { null, C1 } next = null;
  }
}

Running avro-tools still produces the error:

avro-tools idl mut_rec_protocol.avdl

outputs:

Exception in thread "main" org.apache.avro.compiler.idl.ParseException: Undefined name 'test.C2', at line 4, column 19
    at org.apache.avro.compiler.idl.Idl.error(Idl.java:68)
    at org.apache.avro.compiler.idl.Idl.ReferenceType(Idl.java:875)
    at org.apache.avro.compiler.idl.Idl.Type(Idl.java:789)
    at org.apache.avro.compiler.idl.Idl.UnionDefinition(Idl.java:195)
    at org.apache.avro.compiler.idl.Idl.Type(Idl.java:807)
    at org.apache.avro.compiler.idl.Idl.FieldDeclaration(Idl.java:590)
    at org.apache.avro.compiler.idl.Idl.RecordDeclaration(Idl.java:554)
    at org.apache.avro.compiler.idl.Idl.NamedSchemaDeclaration(Idl.java:155)
    at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:389)
    at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:229)
    at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:116)
    at org.apache.avro.tool.IdlTool.run(IdlTool.java:65)
    at org.apache.avro.tool.Main.run(Main.java:87)
    at org.apache.avro.tool.Main.main(Main.java:76)

Will this be sufficiently fixed anytime soon?

Many of these tickets are not only unresolved, but it seems that some of the authors are ideologically against the idea. Additionally, patches and PRs are very slow to get into Avro. AVRO-1723 was started in 2015 and was only merged May 2017.

And here are some other comments:

Thinking a bit more about this request.... I am not sure it is a very good idea. It is simple, of course, to see how this works from a Java implementation. Java already supports circular references, so I thought maybe Avro could as well. However, Avro is not Java. Just consider the 'getSchema().toString()' that this would cause. Normally, you would get the completely de-referenced schema of the object. If Avro supported circular references, this would be infinitely long, continually bouncing back and forth between the references. Changes could be made to still support how that is represented, but I am not sure I like the impact. —Doug Houck

eHarmony / aloha